多机多卡训练出错 #314

Open
opened 2024-10-28 17:50:46 +08:00 by 11322648042cs · 1 comment

多机多卡环境训练出错,我替换了master的ip地址(如图所示),还是出错!

多机多卡环境训练出错,我替换了master的ip地址(如图所示),还是出错!
69 KiB
74 KiB

请确保有cd code,这里的命令应该在code目录下进行

应该是路径问题,你试试在
NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload
代码中将llama2_7b_chat_lora_lawyer_e3_copy.py替换为完整的llama2_7b_chat_lora_lawyer_e3_copy.py路径看可不可以成功。后面那个命令也是同理

请确保有cd code,这里的命令应该在code目录下进行 应该是路径问题,你试试在 NPROC_PER_NODE=4 NNODES=2 PORT=22222 ADDR=10.244.209.56 NODE_RANK=0 xtuner train llama2_7b_chat_lora_lawyer_e3_copy.py --work-dir /userhome/xtuner-workdir3 -- deepspeed deepspeed_zero3_offload 代码中将llama2_7b_chat_lora_lawyer_e3_copy.py替换为完整的llama2_7b_chat_lora_lawyer_e3_copy.py路径看可不可以成功。后面那个命令也是同理
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_course#314
No description provided.