多机多卡微调--day_27--20250307--大语言模型训练篇--提示词工程--多机多卡微调及fastgpt模型部署--林希老师 #608

Open
opened 2025-03-08 17:43:26 +08:00 by 11301095554cs · 1 comment

节点短报错,
root@rb5e1ae3fa8d4877a434b16f230c36d9-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.59.113 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir ./xtuner-workdir2 --deepspeed deepspeed_zero3_offload
[2025-03-08 17:35:06,034] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-08 17:35:10,750] torch.distributed.run: [WARNING]
[2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] *****************************************
[2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] *****************************************
[2025-03-08 17:35:15,256] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-08 17:35:15,267] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-08 17:35:15,326] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-08 17:35:15,346] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2025-03-08 17:35:17,563] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0308 17:35:17.565097 13768 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=325088960
I0308 17:35:17.566758 13768 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=324977344
I0308 17:35:17.567394 13768 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=326237248
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2025-03-08 17:35:17,689] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0308 17:35:17.691030 13766 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=301595808
I0308 17:35:17.691620 13766 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=301482224
I0308 17:35:17.691999 13766 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302740544
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2025-03-08 17:35:17,734] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0308 17:35:17.736542 13767 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=299028336
I0308 17:35:17.738215 13767 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=298914864
I0308 17:35:17.739015 13767 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=300176272
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2025-03-08 17:35:17,818] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0308 17:35:17.819934 13769 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309198256
I0308 17:35:17.821812 13769 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309085600
I0308 17:35:17.822396 13769 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310347104
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
runner = RUNNERS.build(cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
runner = runner_cls.from_cfg(args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
runner = cls(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in init
self.strategy = self.build_strategy(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy
strategy_obj = STRATEGIES.build(strategy)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in init
super().init(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in init
self._setup_env(**self._env_kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env
broadcast(timestamp)
File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast
torch_dist.broadcast(data_on_device, src, group)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connect to 10.244.59.113<36627> failed : Connection refused
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
runner = RUNNERS.build(cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
runner = runner_cls.from_cfg(args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
runner = cls(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in init
self.strategy = self.build_strategy(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy
strategy_obj = STRATEGIES.build(strategy)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in init
super().init(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in init
self._setup_env(**self._env_kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env
broadcast(timestamp)
File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast
torch_dist.broadcast(data_on_device, src, group)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connect to 10.244.59.113<36627> failed : Connection refused
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
runner = RUNNERS.build(cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
runner = runner_cls.from_cfg(args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
runner = cls(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in init
self.strategy = self.build_strategy(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy
strategy_obj = STRATEGIES.build(strategy)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in init
super().init(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in init
self._setup_env(**self._env_kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env
broadcast(timestamp)
File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast
torch_dist.broadcast(data_on_device, src, group)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connect to 10.244.59.113<36627> failed : Connection refused
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
runner = RUNNERS.build(cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
runner = runner_cls.from_cfg(args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
runner = cls(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in init
self.strategy = self.build_strategy(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy
strategy_obj = STRATEGIES.build(strategy)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in init
super().init(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in init
self._setup_env(**self._env_kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env
broadcast(timestamp)
File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast
torch_dist.broadcast(data_on_device, src, group)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connect to 10.244.59.113<36627> failed : Connection refused
I0308 17:35:43.405612 13768 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2
I0308 17:35:43.433612 13767 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1
I0308 17:35:43.453066 13766 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0
I0308 17:35:43.500288 13769 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3
[2025-03-08 17:35:50,842] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 13766) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

Failures:
[1]:
time : 2025-03-08_17:35:50
host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local
rank : 5 (local_rank: 1)
exitcode : 1 (pid: 13767)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2025-03-08_17:35:50
host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local
rank : 6 (local_rank: 2)
exitcode : 1 (pid: 13768)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2025-03-08_17:35:50
host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local
rank : 7 (local_rank: 3)
exitcode : 1 (pid: 13769)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-03-08_17:35:50
host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local
rank : 4 (local_rank: 0)
exitcode : 1 (pid: 13766)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

节点短报错, root@rb5e1ae3fa8d4877a434b16f230c36d9-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.59.113 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir ./xtuner-workdir2 --deepspeed deepspeed_zero3_offload [2025-03-08 17:35:06,034] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] [2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] ***************************************** [2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] ***************************************** [2025-03-08 17:35:15,256] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-03-08 17:35:15,267] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-03-08 17:35:15,326] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-03-08 17:35:15,346] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2025-03-08 17:35:17,563] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I0308 17:35:17.565097 13768 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=325088960 I0308 17:35:17.566758 13768 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=324977344 I0308 17:35:17.567394 13768 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=326237248 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2025-03-08 17:35:17,689] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I0308 17:35:17.691030 13766 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=301595808 I0308 17:35:17.691620 13766 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=301482224 I0308 17:35:17.691999 13766 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302740544 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2025-03-08 17:35:17,734] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I0308 17:35:17.736542 13767 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=299028336 I0308 17:35:17.738215 13767 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=298914864 I0308 17:35:17.739015 13767 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=300176272 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2025-03-08 17:35:17,818] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I0308 17:35:17.819934 13769 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309198256 I0308 17:35:17.821812 13769 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309085600 I0308 17:35:17.822396 13769 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310347104 Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main runner = RUNNERS.build(cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg runner = runner_cls.from_cfg(args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg runner = cls( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in __init__ self.strategy = self.build_strategy( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy strategy_obj = STRATEGIES.build(strategy) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in __init__ super().__init__(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in __init__ super().__init__(**kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in __init__ self._setup_env(**self._env_kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env broadcast(timestamp) File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast torch_dist.broadcast(data_on_device, src, group) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast work = default_pg.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Connect to 10.244.59.113<36627> failed : Connection refused Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main runner = RUNNERS.build(cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg runner = runner_cls.from_cfg(args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg runner = cls( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in __init__ self.strategy = self.build_strategy( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy strategy_obj = STRATEGIES.build(strategy) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in __init__ super().__init__(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in __init__ super().__init__(**kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in __init__ self._setup_env(**self._env_kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env broadcast(timestamp) File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast torch_dist.broadcast(data_on_device, src, group) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast work = default_pg.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Connect to 10.244.59.113<36627> failed : Connection refused Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main runner = RUNNERS.build(cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg runner = runner_cls.from_cfg(args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg runner = cls( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in __init__ self.strategy = self.build_strategy( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy strategy_obj = STRATEGIES.build(strategy) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in __init__ super().__init__(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in __init__ super().__init__(**kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in __init__ self._setup_env(**self._env_kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env broadcast(timestamp) File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast torch_dist.broadcast(data_on_device, src, group) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast work = default_pg.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Connect to 10.244.59.113<36627> failed : Connection refused Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main runner = RUNNERS.build(cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg runner = runner_cls.from_cfg(args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg runner = cls( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in __init__ self.strategy = self.build_strategy( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy strategy_obj = STRATEGIES.build(strategy) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in __init__ super().__init__(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in __init__ super().__init__(**kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in __init__ self._setup_env(**self._env_kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env broadcast(timestamp) File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast torch_dist.broadcast(data_on_device, src, group) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast work = default_pg.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Connect to 10.244.59.113<36627> failed : Connection refused I0308 17:35:43.405612 13768 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2 I0308 17:35:43.433612 13767 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1 I0308 17:35:43.453066 13766 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0 I0308 17:35:43.500288 13769 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3 [2025-03-08 17:35:50,842] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 13766) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-03-08_17:35:50 host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local rank : 5 (local_rank: 1) exitcode : 1 (pid: 13767) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2025-03-08_17:35:50 host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local rank : 6 (local_rank: 2) exitcode : 1 (pid: 13768) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2025-03-08_17:35:50 host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local rank : 7 (local_rank: 3) exitcode : 1 (pid: 13769) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-03-08_17:35:50 host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local rank : 4 (local_rank: 0) exitcode : 1 (pid: 13766) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

通信问题,多机多卡随着机器的数量,发生通信问题的概率越大,多重新启动几次就行。

通信问题,多机多卡随着机器的数量,发生通信问题的概率越大,多重新启动几次就行。
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_course#608
No description provided.