使用xtuner微调开源模型 林希老师 运行deepseed加速报通信錯誤,咋处理 #756
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: HswOAuth/llm_course#756
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
运行 NPROC_PER_NODE=4 xtuner train /code/llama2_7b_chat_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero3

提示错误-输出太多,只截取了我认为重要的
RuntimeError: Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
Original exception:
ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED