多机多卡训练出错 #314
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: HswOAuth/llm_course#314
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
多机多卡环境训练出错,我替换了master的ip地址(如图所示),还是出错!
请确保有cd code,这里的命令应该在code目录下进行
应该是路径问题,你试试在
NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload
代码中将llama2_7b_chat_lora_lawyer_e3_copy.py替换为完整的llama2_7b_chat_lora_lawyer_e3_copy.py路径看可不可以成功。后面那个命令也是同理