多机多卡训练实验报错 #313

Open
opened 2024-10-28 17:32:57 +08:00 by 11293382865cs · 4 comments

多机多卡模型调试篇在查找task0机代码分别在task0与task1上运行NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload代码NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=1 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload后发生以下报错

多机多卡模型调试篇在查找task0机代码分别在task0与task1上运行NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload代码NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=1 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload后发生以下报错

应该是路径问题,你试试在
NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload
代码中将llama2_7b_chat_lora_lawyer_e3_copy.py替换为完整的llama2_7b_chat_lora_lawyer_e3_copy.py路径看可不可以成功。后面那个命令也是同理

应该是路径问题,你试试在 NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload 代码中将llama2_7b_chat_lora_lawyer_e3_copy.py替换为完整的llama2_7b_chat_lora_lawyer_e3_copy.py路径看可不可以成功。后面那个命令也是同理

请确保有cd code,这里的命令应该在code目录下进行

请确保有`cd code`,这里的命令应该在code目录下进行
Author

应该是路径问题,你试试在
NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload
代码中将llama2_7b_chat_lora_lawyer_e3_copy.py替换为完整的llama2_7b_chat_lora_lawyer_e3_copy.py路径看可不可以成功。后面那个命令也是同理

这个问题解决了后面运行训练命令的时候显示权限不足

> 应该是路径问题,你试试在 > NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload > 代码中将llama2_7b_chat_lora_lawyer_e3_copy.py替换为完整的llama2_7b_chat_lora_lawyer_e3_copy.py路径看可不可以成功。后面那个命令也是同理 这个问题解决了后面运行训练命令的时候显示权限不足

这个地方有问题,上面的run变量也同样有问题
image

这个地方有问题,上面的run变量也同样有问题 <img width="903" alt="image" src="/attachments/b206cab9-10e9-4157-8cc6-f90ada56040d">
785 KiB
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_course#313
No description provided.