大模型第三节跟连及操作视频 #88
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: HswOAuth/llm_course#88
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
准备API Key
首先,根据自己的需求申请API key
创建开发环境
由于待会需要运用2机8卡训练,所以先在平台上创建2个notebook。
打开notebook进入VS Code
打开文件夹,输入/code/进入code栏目
在右上角打开终端
设置环境变量
网络连接配置
在两台服务器上都配置HuggingFace镜像和网络代理:
export HF_HOME=/code/huggingface-cache/
export HF_ENDPOINT=https://hf-mirror.com
export http_proxy=http://10.10.9.50:3000
export https_proxy=http://10.10.9.50:3000
export no_proxy=localhost,127.0.0.1
设置对应的API-Key
export ZHIPUAI_API_KEY=*****
配置IB网卡
在两台服务器上分别输入:
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
开始训练
获取 master IP 地址
任意选择一台服务做(notebook)作为主服务器(master),并用ifconfig获取该服务器的IP地址
启动训练
使用刚刚查询的主服务器IP地址,启动训练的命令如下:
在第一台服务器输入:
NPROC_PER_NODE=4NNODES=2PORT=12345ADDR=10.244.132.114 NODE_RANK=0 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
在第二台服务器输入:
NPROC_PER_NODE=4NNODES=2PORT=12345ADDR=10.244.132.114 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
训练结束后
模型转换
将lora模型训练的checkpoint转成hf格式的模型:
xtuner convert pth_to_hf /code/llama2_7b_chat_qlora_sql_e3_copy.py /code/xtuner-workdir/iter_500.pth/ /code/iter_500_hf/
执行测试
用如下命令测试微调后的结果:
python final_test.py
微调后结果如下: