多机多卡训练实验报错 #313
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: HswOAuth/llm_course#313
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
多机多卡模型调试篇在查找task0机代码分别在task0与task1上运行NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload代码NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=1 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload后发生以下报错
应该是路径问题,你试试在
NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload
代码中将llama2_7b_chat_lora_lawyer_e3_copy.py替换为完整的llama2_7b_chat_lora_lawyer_e3_copy.py路径看可不可以成功。后面那个命令也是同理
请确保有
cd code
,这里的命令应该在code目录下进行这个问题解决了后面运行训练命令的时候显示权限不足
这个地方有问题,上面的run变量也同样有问题
