多机多卡微调--day_27--20250307--大语言模型训练篇--提示词工程--多机多卡微调及fastgpt模型部署--林希老师 #608
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: HswOAuth/llm_course#608
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
节点短报错,
root@rb5e1ae3fa8d4877a434b16f230c36d9-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.59.113 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir ./xtuner-workdir2 --deepspeed deepspeed_zero3_offload
[2025-03-08 17:35:06,034] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-08 17:35:10,750] torch.distributed.run: [WARNING]
[2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] *****************************************
[2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-03-08 17:35:10,750] torch.distributed.run: [WARNING] *****************************************
[2025-03-08 17:35:15,256] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-08 17:35:15,267] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-08 17:35:15,326] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-08 17:35:15,346] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2025-03-08 17:35:17,563] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0308 17:35:17.565097 13768 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=325088960
I0308 17:35:17.566758 13768 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=324977344
I0308 17:35:17.567394 13768 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=326237248
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2025-03-08 17:35:17,689] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0308 17:35:17.691030 13766 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=301595808
I0308 17:35:17.691620 13766 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=301482224
I0308 17:35:17.691999 13766 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302740544
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2025-03-08 17:35:17,734] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0308 17:35:17.736542 13767 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=299028336
I0308 17:35:17.738215 13767 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=298914864
I0308 17:35:17.739015 13767 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=300176272
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2025-03-08 17:35:17,818] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0308 17:35:17.819934 13769 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309198256
I0308 17:35:17.821812 13769 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309085600
I0308 17:35:17.822396 13769 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310347104
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
runner = RUNNERS.build(cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
runner = runner_cls.from_cfg(args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
runner = cls(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in init
self.strategy = self.build_strategy(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy
strategy_obj = STRATEGIES.build(strategy)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in init
super().init(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in init
self._setup_env(**self._env_kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env
broadcast(timestamp)
File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast
torch_dist.broadcast(data_on_device, src, group)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connect to 10.244.59.113<36627> failed : Connection refused
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
runner = RUNNERS.build(cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
runner = runner_cls.from_cfg(args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
runner = cls(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in init
self.strategy = self.build_strategy(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy
strategy_obj = STRATEGIES.build(strategy)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in init
super().init(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in init
self._setup_env(**self._env_kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env
broadcast(timestamp)
File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast
torch_dist.broadcast(data_on_device, src, group)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connect to 10.244.59.113<36627> failed : Connection refused
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
runner = RUNNERS.build(cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
runner = runner_cls.from_cfg(args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
runner = cls(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in init
self.strategy = self.build_strategy(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy
strategy_obj = STRATEGIES.build(strategy)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in init
super().init(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in init
self._setup_env(**self._env_kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env
broadcast(timestamp)
File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast
torch_dist.broadcast(data_on_device, src, group)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connect to 10.244.59.113<36627> failed : Connection refused
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
runner = RUNNERS.build(cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
runner = runner_cls.from_cfg(args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
runner = cls(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 363, in init
self.strategy = self.build_strategy(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 665, in build_strategy
strategy_obj = STRATEGIES.build(strategy)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 17, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 286, in init
super().init(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 78, in init
self._setup_env(**self._env_kwargs)
File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 230, in _setup_env
broadcast(timestamp)
File "/opt/conda/lib/python3.10/site-packages/mmengine/dist/dist.py", line 312, in broadcast
torch_dist.broadcast(data_on_device, src, group)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1328, remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connect to 10.244.59.113<36627> failed : Connection refused
I0308 17:35:43.405612 13768 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2
I0308 17:35:43.433612 13767 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1
I0308 17:35:43.453066 13766 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0
I0308 17:35:43.500288 13769 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3
[2025-03-08 17:35:50,842] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 13766) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
Failures:
[1]:
time : 2025-03-08_17:35:50
host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local
rank : 5 (local_rank: 1)
exitcode : 1 (pid: 13767)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2025-03-08_17:35:50
host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local
rank : 6 (local_rank: 2)
exitcode : 1 (pid: 13768)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2025-03-08_17:35:50
host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local
rank : 7 (local_rank: 3)
exitcode : 1 (pid: 13769)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2025-03-08_17:35:50
host : rb5e1ae3fa8d4877a434b16f230c36d9-task1-0.rb5e1ae3fa8d4877a434b16f230c36d9.4192a7b7d7864c08ac303a9b31873fb9.svc.cluster.local
rank : 4 (local_rank: 0)
exitcode : 1 (pid: 13766)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
通信问题,多机多卡随着机器的数量,发生通信问题的概率越大,多重新启动几次就行。