使用xtuner微调开源模型 林希老师 运行deepseed加速报通信錯誤,咋处理 #756

Open
opened 2025-06-30 17:37:20 +08:00 by 11900303041cs · 0 comments

运行 NPROC_PER_NODE=4 xtuner train /code/llama2_7b_chat_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero3
提示错误-输出太多,只截取了我认为重要的
image
RuntimeError: Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
Original exception:

ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

运行 NPROC_PER_NODE=4 xtuner train /code/llama2_7b_chat_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero3 提示错误-输出太多,只截取了我认为重要的 ![image](/attachments/e993d0d4-6c62-426c-87ee-48e6059df009) RuntimeError: Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. Original exception: ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_course#756
No description provided.