求助贴 #215

New Issue

11252177484cs · 2024-10-17T19:25:47+08:00

11252177484cs commented

2024-10-17 19:25:47 +08:00

11-24.8.29提示词工程部署-基德老师
这一课程，运行notebook 开展微调训练失败，从task0 日志来看，应该是请求被task1 拒绝。查看task1 日志，主要错误：
FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory

连续运行两次，都失败了。
命令输入是准确的

完整日志如下：

root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.47.249 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
[2024-10-17 19:12:30,056] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING]
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] *****************************************
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] *****************************************
[2024-10-17 19:13:23,347] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:13:23,452] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:13:23,466] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:13:23,580] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:25,746] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:25.747555 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316361024
I1017 19:13:25.748116 435 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316249952
I1017 19:13:25.748484 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317510800
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:25,926] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:25.927821 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309610208
I1017 19:13:25.928458 436 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309497088
I1017 19:13:25.928822 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310757040
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:25,989] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:25.990494 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331114656
I1017 19:13:25.991246 438 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331001456
I1017 19:13:25.991477 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332262704
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:26,022] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:26.023967 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331345600
I1017 19:13:26.024638 437 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331230928
I1017 19:13:26.024919 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332494336
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
Traceback (most recent call last):
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
loop = LOOPS.build(return self.build_func(cfg, *args, **kwargs, registry=self)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
main()return self.build_func(cfg, *args, **kwargs, registry=self)

  File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
return self.build_func(cfg, *args, **kwargs, registry=self)dataset = DATASETS.build(dataset_cfg)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
loop = LOOPS.build(
return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
return func(*args, **kwargs)return self.build_func(cfg, *args, **kwargs, registry=self)

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
dist.broadcast_object_list(objects, src=0)
dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper

File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
return _unpickler(io.BytesIO(buf)).load()return func(*args, **kwargs)

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
return _unpickler(io.BytesIO(buf)).load() memory_mapped_stream = pa.memory_map(filename)

table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory

FileNotFoundError: [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory:
[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory
I1017 19:13:46.448297 437 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2
I1017 19:13:46.518741 438 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3
I1017 19:13:46.599797 435 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0
I1017 19:13:46.670159 436 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1
[2024-10-17 19:13:53,949] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 435) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

Failures:
[1]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 5 (local_rank: 1)
exitcode : 1 (pid: 436)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 6 (local_rank: 2)
exitcode : 1 (pid: 437)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 7 (local_rank: 3)
exitcode : 1 (pid: 438)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 4 (local_rank: 0)
exitcode : 1 (pid: 435)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480
inet 10.244.96.101 netmask 255.255.255.255 broadcast 10.244.96.101
ether 8a:49:21:d5:d4:00 txqueuelen 0 (Ethernet)
RX packets 3061 bytes 587751 (587.7 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2433 bytes 6017827 (6.0 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 1454 bytes 441940 (441.9 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1454 bytes 441940 (441.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code#

11-24.8.29提示词工程部署-基德老师这一课程，运行notebook 开展微调训练失败，从task0 日志来看，应该是请求被task1 拒绝。查看task1 日志，主要错误： FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory 连续运行两次，都失败了。命令输入是准确的完整日志如下： root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.47.249 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload [2024-10-17 19:12:30,056] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] [2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] ***************************************** [2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] ***************************************** [2024-10-17 19:13:23,347] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:13:23,452] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:13:23,466] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:13:23,580] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-17 19:13:25,746] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 19:13:25.747555 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316361024 I1017 19:13:25.748116 435 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316249952 I1017 19:13:25.748484 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317510800 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-17 19:13:25,926] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 19:13:25.927821 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309610208 I1017 19:13:25.928458 436 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309497088 I1017 19:13:25.928822 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310757040 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-17 19:13:25,989] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 19:13:25.990494 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331114656 I1017 19:13:25.991246 438 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331001456 I1017 19:13:25.991477 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332262704 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-17 19:13:26,022] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 19:13:26.023967 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331345600 I1017 19:13:26.024638 437 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331230928 I1017 19:13:26.024919 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332494336 /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main Traceback (most recent call last): runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> loop = LOOPS.build(return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ main()return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ return self.build_func(cfg, *args, **kwargs, registry=self)dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg dist.broadcast_object_list(objects, src=0) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset loop = LOOPS.build( return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list dist.broadcast_object_list(objects, src=0) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg return func(*args, **kwargs)return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ dist.broadcast_object_list(objects, src=0) dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset dist.broadcast_object_list(objects, src=0) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object return _unpickler(io.BytesIO(buf)).load()return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file return _unpickler(io.BytesIO(buf)).load() memory_mapped_stream = pa.memory_map(filename) table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory FileNotFoundError: [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory: [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory I1017 19:13:46.448297 437 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2 I1017 19:13:46.518741 438 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3 I1017 19:13:46.599797 435 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0 I1017 19:13:46.670159 436 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1 [2024-10-17 19:13:53,949] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 435) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-10-17_19:13:53 host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 5 (local_rank: 1) exitcode : 1 (pid: 436) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-10-17_19:13:53 host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 6 (local_rank: 2) exitcode : 1 (pid: 437) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-10-17_19:13:53 host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 7 (local_rank: 3) exitcode : 1 (pid: 438) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-17_19:13:53 host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 4 (local_rank: 0) exitcode : 1 (pid: 435) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480 inet 10.244.96.101 netmask 255.255.255.255 broadcast 10.244.96.101 ether 8a:49:21:d5:d4:00 txqueuelen 0 (Ethernet) RX packets 3061 bytes 587751 (587.7 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2433 bytes 6017827 (6.0 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1000 (Local Loopback) RX packets 1454 bytes 441940 (441.9 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1454 bytes 441940 (441.9 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code#

21547230244cs commented

2024-10-18 17:44:58 +08:00

从提供的截图来看，应该是找不到训练的数据sql_datasets

进入两个notebook后，请先确保都code路径下；如果没有在code路径下，请先cd code

接下来就要设置环境变量：

网络连接配置
在两台服务器上都配置HuggingFace镜像和网络代理：
export HF_HOME=/code/huggingface-cache/
export HF_ENDPOINT=https://hf-mirror.com
export http_proxy=http://10.10.9.50:3000
export https_proxy=http://10.10.9.50:3000
export no_proxy=localhost,127.0.0.1

设置对应的API-Key
export ZHIPUAI_API_KEY=*****

配置IB网卡
在两台服务器上分别输入：
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0

接下来再启动训练

从提供的截图来看，应该是找不到训练的数据sql_datasets 进入两个notebook后，请**先确保都code路径下**；如果没有在code路径下，请先cd code <img width="1124" alt="image" src="/attachments/9c1b608b-2ed1-4ed6-a014-27d8563fae6d"> 接下来就要设置环境变量： 1. 网络连接配置在两台服务器上都配置HuggingFace镜像和网络代理： export HF_HOME=/code/huggingface-cache/ export HF_ENDPOINT=https://hf-mirror.com export http_proxy=http://10.10.9.50:3000 export https_proxy=http://10.10.9.50:3000 export no_proxy=localhost,127.0.0.1 设置对应的API-Key export ZHIPUAI_API_KEY=***** 2. 配置IB网卡在两台服务器上分别输入： export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=0 export NCCL_IB_HCA=mlx5 export NCCL_SOCKET_IFNAME=eth0 export GLOO_SOCKET_IFNAME=eth0 接下来再启动训练

image.png

217 KiB

image.png

709 KiB

11252177484cs commented

2024-10-21 18:40:16 +08:00

1\ 我是微调失败，不是训练失败。微调应该也不需要设置环境变量
2、都是在code 目录下面的，从日志截图可以看到

1\ 我是微调失败，不是训练失败。微调应该也不需要设置环境变量 2、都是在code 目录下面的，从日志截图可以看到

11252177484cs commented

2024-10-21 18:49:13 +08:00

主机日志：
root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=0 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
[2024-10-21 18:43:19,765] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING]
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] *
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] *
[2024-10-21 18:44:27,005] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:44:27,067] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:44:27,105] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:44:27,223] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,322] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.323154 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302548912
I1021 18:44:29.323541 159 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302435168
I1021 18:44:29.324101 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303696912
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,486] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-21 18:44:29,486] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.486964 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331055216
I1021 18:44:29.487335 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=330940544
I1021 18:44:29.487955 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332205008
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,523] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.524077 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302394592
I1021 18:44:29.524586 160 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302282032
I1021 18:44:29.525072 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303540880
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,578] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.579651 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323494592
I1021 18:44:29.580243 161 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323381392
I1021 18:44:29.580680 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=324643504
I1021 18:44:31.178613 158 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
10/21 18:44:31 - mmengine - INFO -

System environment:
sys.platform: linux
Python: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1486578312
GPU 0,1,2,3: Z100SM
CUDA_HOME: /opt/dtk
NVCC: Not Available
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201703
Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
HIP Runtime 5.7.24164
MIOpen 2.15.4
Magma 2.7.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=1, USE_GLOG=1, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=1, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=ON,

TorchVision: 0.16.0
OpenCV: 4.9.0
MMEngine: 0.10.3

Runtime environment:
launcher: pytorch
randomness: {'seed': None, 'deterministic': False}
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: None
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 8

10/21 18:44:31 - mmengine - INFO - Config:
SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.sql'
accumulative_counts = 8
batch_size = 4
betas = (
0.9,
0.999,
)
custom_hooks = [
dict(
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.engine.hooks.DatasetInfoHook'),
]
data_path = '/dataset/datasets/sql_datasets'
dataloader_num_workers = 0
default_hooks = dict(
checkpoint=dict(
by_epoch=False,
interval=500,
max_keep_ckpts=2,
type='mmengine.hooks.CheckpointHook'),
logger=dict(
interval=10,
log_metric_by_epoch=False,
type='mmengine.hooks.LoggerHook'),
param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'),
sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'),
timer=dict(type='mmengine.hooks.IterTimerHook'))
env_cfg = dict(
cudnn_benchmark=False,
dist_cfg=dict(backend='nccl'),
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
evaluation_freq = 500
evaluation_inputs = [
'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.',
'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。',
]
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False)
lr = 0.0002
max_dataset_length = 16000
max_epochs = 1
max_length = 2048
max_norm = 1
model = dict(
llm=dict(
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
torch_dtype='torch.bfloat16',
trust_remote_code=True,
type='transformers.AutoModelForCausalLM.from_pretrained'),
lora=dict(
bias='none',
lora_alpha=16,
lora_dropout=0.1,
r=64,
task_type='CAUSAL_LM',
type='peft.LoraConfig'),
type='xtuner.model.SupervisedFinetune',
use_varlen_attn=False)
optim_type = 'torch.optim.AdamW'
optim_wrapper = dict(
optimizer=dict(
betas=(
0.9,
0.999,
),
lr=0.0002,
type='torch.optim.AdamW',
weight_decay=0),
type='DeepSpeedOptimWrapper')
pack_to_max_length = False
param_scheduler = [
dict(
begin=0,
by_epoch=True,
convert_to_iter_based=True,
end=0.03,
start_factor=1e-05,
type='mmengine.optim.LinearLR'),
dict(
begin=0.03,
by_epoch=True,
convert_to_iter_based=True,
end=1,
eta_min=0.0,
type='mmengine.optim.CosineAnnealingLR'),
]
pretrained_model_name_or_path = '/dataset/CodeLlama-7b-hf/'
prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat'
randomness = dict(deterministic=False, seed=None)
resume = False
runner_type = 'FlexibleRunner'
sampler = 'mmengine.dataset.DefaultSampler'
save_steps = 500
save_total_limit = 2
sequence_parallel_size = 1
strategy = dict(
config=dict(
bf16=dict(enabled=True),
fp16=dict(enabled=False, initial_scale_power=16),
gradient_accumulation_steps='auto',
gradient_clipping='auto',
train_micro_batch_size_per_gpu='auto',
zero_allow_untested_optimizer=True,
zero_force_ds_cpu_optimizer=False,
zero_optimization=dict(
offload_optimizer=dict(device='cpu', pin_memory=True),
offload_param=dict(device='cpu', pin_memory=True),
overlap_comm=True,
stage=3,
stage3_gather_16bit_weights_on_model_save=True)),
exclude_frozen_parameters=True,
gradient_accumulation_steps=8,
gradient_clipping=1,
sequence_parallel_size=1,
train_micro_batch_size_per_gpu=4,
type='xtuner.engine.DeepSpeedStrategy')
tokenizer = dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained')
train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop')
train_dataloader = dict(
batch_size=4,
collate_fn=dict(
type='xtuner.dataset.collate_fns.default_collate_fn',
use_varlen_attn=False),
dataset=dict(
dataset=dict(
path='/dataset/datasets/sql_datasets',
type='datasets.load_dataset'),
dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
max_dataset_length=16000,
max_length=2048,
pack_to_max_length=False,
remove_unused_columns=True,
shuffle_before_pack=True,
template_map_fn=dict(
template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
type='xtuner.dataset.map_fns.template_map_fn_factory'),
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.dataset.process_hf_dataset',
use_varlen_attn=False),
num_workers=0,
sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler'))
train_dataset = dict(
dataset=dict(
path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'),
dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
max_dataset_length=16000,
max_length=2048,
pack_to_max_length=False,
remove_unused_columns=True,
shuffle_before_pack=True,
template_map_fn=dict(
template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
type='xtuner.dataset.map_fns.template_map_fn_factory'),
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.dataset.process_hf_dataset',
use_varlen_attn=False)
use_varlen_attn = False
visualizer = None
warmup_ratio = 0.03
weight_decay = 0
work_dir = '/code/xtuner-workdir'

10/21 18:44:31 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
10/21 18:44:31 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook

before_train:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DatasetInfoHook
(VERY_LOW ) CheckpointHook

before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook

before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook

after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

before_val:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook

before_val_epoch:
(NORMAL ) IterTimerHook

before_val_iter:
(NORMAL ) IterTimerHook

after_val_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

after_val:
(VERY_HIGH ) RuntimeInfoHook

after_train:
(VERY_HIGH ) RuntimeInfoHook
(VERY_LOW ) CheckpointHook

before_test:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook

before_test_epoch:
(NORMAL ) IterTimerHook

before_test_iter:
(NORMAL ) IterTimerHook

after_test_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test:
(VERY_HIGH ) RuntimeInfoHook

after_run:
(BELOW_NORMAL) LoggerHook

10/21 18:44:31 - mmengine - INFO - xtuner_dataset_timeout = 0:30:00
HF google storage unreachable. Downloading and preparing it from source
Generating train split: 78577 examples [00:00, 159509.50 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 32907.08 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 35358.70 examples/s]
Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 44504.11 examples/s]
Map (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:02<00:00, 7307.34 examples/s]
Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 31808.07 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 30408.74 examples/s]
10/21 18:44:49 - mmengine - WARNING - Dataset Dataset has no metainfo. `dataset_meta` in visualizer will be None.
I1021 18:44:54.326004 214 ProcessGroupNCCL.cpp:391] [Rank 1] found async exception when checking for NCCL errors: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer
I1021 18:44:54.326390 214 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1
E1021 18:44:54.326437 214 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E1021 18:44:54.326457 214 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
E1021 18:44:54.326539 214 ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer
[2024-10-21 18:44:57,587] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 158 closing signal SIGTERM
[2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 160 closing signal SIGTERM
[2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 161 closing signal SIGTERM
[2024-10-21 18:44:57,753] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 159) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-21_18:44:57
host : odc6af99c5f9401ba2d75273ca4c97bf-task0-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 159)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 159

root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code#

主机日志： root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=0 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload [2024-10-21 18:43:19,765] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] [2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] ***************************************** [2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] ***************************************** [2024-10-21 18:44:27,005] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:44:27,067] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:44:27,105] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:44:27,223] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:44:29,322] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:44:29.323154 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302548912 I1021 18:44:29.323541 159 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302435168 I1021 18:44:29.324101 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303696912 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:44:29,486] [INFO] [comm.py:637:init_distributed] cdb=None [2024-10-21 18:44:29,486] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:44:29.486964 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331055216 I1021 18:44:29.487335 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=330940544 I1021 18:44:29.487955 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332205008 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:44:29,523] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:44:29.524077 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302394592 I1021 18:44:29.524586 160 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302282032 I1021 18:44:29.525072 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303540880 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:44:29,578] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:44:29.579651 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323494592 I1021 18:44:29.580243 161 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323381392 I1021 18:44:29.580680 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=324643504 I1021 18:44:31.178613 158 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found 10/21 18:44:31 - mmengine - INFO - ------------------------------------------------------------ System environment: sys.platform: linux Python: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 1486578312 GPU 0,1,2,3: Z100SM CUDA_HOME: /opt/dtk NVCC: Not Available GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.1.0 PyTorch compiling details: PyTorch built with: - GCC 7.3 - C++ Version: 201703 - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX2 - HIP Runtime 5.7.24164 - MIOpen 2.15.4 - Magma 2.7.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=1, USE_GLOG=1, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=1, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=ON, TorchVision: 0.16.0 OpenCV: 4.9.0 MMEngine: 0.10.3 Runtime environment: launcher: pytorch randomness: {'seed': None, 'deterministic': False} cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: None deterministic: False Distributed launcher: pytorch Distributed training: True GPU number: 8 ------------------------------------------------------------ 10/21 18:44:31 - mmengine - INFO - Config: SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.sql' accumulative_counts = 8 batch_size = 4 betas = ( 0.9, 0.999, ) custom_hooks = [ dict( tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.hooks.DatasetInfoHook'), ] data_path = '/dataset/datasets/sql_datasets' dataloader_num_workers = 0 default_hooks = dict( checkpoint=dict( by_epoch=False, interval=500, max_keep_ckpts=2, type='mmengine.hooks.CheckpointHook'), logger=dict( interval=10, log_metric_by_epoch=False, type='mmengine.hooks.LoggerHook'), param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'), sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'), timer=dict(type='mmengine.hooks.IterTimerHook')) env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0)) evaluation_freq = 500 evaluation_inputs = [ 'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.', 'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。', ] launcher = 'pytorch' load_from = None log_level = 'INFO' log_processor = dict(by_epoch=False) lr = 0.0002 max_dataset_length = 16000 max_epochs = 1 max_length = 2048 max_norm = 1 model = dict( llm=dict( pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', torch_dtype='torch.bfloat16', trust_remote_code=True, type='transformers.AutoModelForCausalLM.from_pretrained'), lora=dict( bias='none', lora_alpha=16, lora_dropout=0.1, r=64, task_type='CAUSAL_LM', type='peft.LoraConfig'), type='xtuner.model.SupervisedFinetune', use_varlen_attn=False) optim_type = 'torch.optim.AdamW' optim_wrapper = dict( optimizer=dict( betas=( 0.9, 0.999, ), lr=0.0002, type='torch.optim.AdamW', weight_decay=0), type='DeepSpeedOptimWrapper') pack_to_max_length = False param_scheduler = [ dict( begin=0, by_epoch=True, convert_to_iter_based=True, end=0.03, start_factor=1e-05, type='mmengine.optim.LinearLR'), dict( begin=0.03, by_epoch=True, convert_to_iter_based=True, end=1, eta_min=0.0, type='mmengine.optim.CosineAnnealingLR'), ] pretrained_model_name_or_path = '/dataset/CodeLlama-7b-hf/' prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat' randomness = dict(deterministic=False, seed=None) resume = False runner_type = 'FlexibleRunner' sampler = 'mmengine.dataset.DefaultSampler' save_steps = 500 save_total_limit = 2 sequence_parallel_size = 1 strategy = dict( config=dict( bf16=dict(enabled=True), fp16=dict(enabled=False, initial_scale_power=16), gradient_accumulation_steps='auto', gradient_clipping='auto', train_micro_batch_size_per_gpu='auto', zero_allow_untested_optimizer=True, zero_force_ds_cpu_optimizer=False, zero_optimization=dict( offload_optimizer=dict(device='cpu', pin_memory=True), offload_param=dict(device='cpu', pin_memory=True), overlap_comm=True, stage=3, stage3_gather_16bit_weights_on_model_save=True)), exclude_frozen_parameters=True, gradient_accumulation_steps=8, gradient_clipping=1, sequence_parallel_size=1, train_micro_batch_size_per_gpu=4, type='xtuner.engine.DeepSpeedStrategy') tokenizer = dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained') train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop') train_dataloader = dict( batch_size=4, collate_fn=dict( type='xtuner.dataset.collate_fns.default_collate_fn', use_varlen_attn=False), dataset=dict( dataset=dict( path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn', max_dataset_length=16000, max_length=2048, pack_to_max_length=False, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False), num_workers=0, sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler')) train_dataset = dict( dataset=dict( path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn', max_dataset_length=16000, max_length=2048, pack_to_max_length=False, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False) use_varlen_attn = False visualizer = None warmup_ratio = 0.03 weight_decay = 0 work_dir = '/code/xtuner-workdir' 10/21 18:44:31 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized. 10/21 18:44:31 - mmengine - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) RuntimeInfoHook (BELOW_NORMAL) LoggerHook -------------------- before_train: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DatasetInfoHook (VERY_LOW ) CheckpointHook -------------------- before_train_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DistSamplerSeedHook -------------------- before_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook -------------------- after_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook -------------------- after_train_epoch: (NORMAL ) IterTimerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook -------------------- before_val: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) DatasetInfoHook -------------------- before_val_epoch: (NORMAL ) IterTimerHook -------------------- before_val_iter: (NORMAL ) IterTimerHook -------------------- after_val_iter: (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook -------------------- after_val_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook -------------------- after_val: (VERY_HIGH ) RuntimeInfoHook -------------------- after_train: (VERY_HIGH ) RuntimeInfoHook (VERY_LOW ) CheckpointHook -------------------- before_test: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) DatasetInfoHook -------------------- before_test_epoch: (NORMAL ) IterTimerHook -------------------- before_test_iter: (NORMAL ) IterTimerHook -------------------- after_test_iter: (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook -------------------- after_test_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook -------------------- after_test: (VERY_HIGH ) RuntimeInfoHook -------------------- after_run: (BELOW_NORMAL) LoggerHook -------------------- 10/21 18:44:31 - mmengine - INFO - xtuner_dataset_timeout = 0:30:00 HF google storage unreachable. Downloading and preparing it from source Generating train split: 78577 examples [00:00, 159509.50 examples/s] Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 32907.08 examples/s] Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 35358.70 examples/s] Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 44504.11 examples/s] Map (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:02<00:00, 7307.34 examples/s] Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 31808.07 examples/s] Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 30408.74 examples/s] 10/21 18:44:49 - mmengine - WARNING - Dataset Dataset has no metainfo. ``dataset_meta`` in visualizer will be None. I1021 18:44:54.326004 214 ProcessGroupNCCL.cpp:391] [Rank 1] found async exception when checking for NCCL errors: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer I1021 18:44:54.326390 214 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1 E1021 18:44:54.326437 214 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. E1021 18:44:54.326457 214 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down. E1021 18:44:54.326539 214 ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer [2024-10-21 18:44:57,587] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 158 closing signal SIGTERM [2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 160 closing signal SIGTERM [2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 161 closing signal SIGTERM [2024-10-21 18:44:57,753] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 159) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-21_18:44:57 host : odc6af99c5f9401ba2d75273ca4c97bf-task0-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 1 (local_rank: 1) exitcode : -6 (pid: 159) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 159 ============================================================ root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code#

11252177484cs commented

2024-10-21 18:50:01 +08:00

task1 日志：
root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
[2024-10-21 18:46:04,204] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING]
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:46:14,015] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:14,036] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:14,054] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:14,069] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:46:16,456] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.457777 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303812352
I1021 18:46:16.458443 152 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303698768
I1021 18:46:16.458766 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=304961568
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:46:16,460] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.462054 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309843648
I1021 18:46:16.463582 151 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309729440
I1021 18:46:16.464460 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310991744
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:46:16,466] [INFO] [comm.py:637:init_distributed] cdb=None
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.467456 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307881488
[2024-10-21 18:46:16,467] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.468452 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317008944
I1021 18:46:16.468518 154 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307768832
I1021 18:46:16.468761 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309028992
I1021 18:46:16.469139 153 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316894864
I1021 18:46:16.469415 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=318156256
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)return self.build_func(cfg, *args, **kwargs, registry=self)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
obj = obj_cls(**args) # type: ignore
return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
dataloader = runner.build_dataloader(dataloader = runner.build_dataloader(dataloader = runner.build_dataloader(

File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
dataset = DATASETS.build(dataset_cfg)dataset = DATASETS.build(dataset_cfg)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
return self.build_func(cfg, *args, **kwargs, registry=self)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
obj = obj_cls(**args) # type: ignoreobj = obj_cls(**args) # type: ignore

File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs) return func(*args, **kwargs)
return func(*args, **kwargs)
return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
object_list[i] = _tensor_to_object(obj_view, obj_size) object_list[i] = _tensor_to_object(obj_view, obj_size)
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
return _unpickler(io.BytesIO(buf)).load()
return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
table = _memory_mapped_arrow_table_from_file(path)table = _memory_mapped_arrow_table_from_file(path)

table = _memory_mapped_arrow_table_from_file(path)  File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file

table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
FileNotFoundErrorFileNotFoundError: : FileNotFoundErrorFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory: :

[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory

I1021 18:46:36.582319 152 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1
I1021 18:46:36.606400 151 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0
I1021 18:46:36.653246 153 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2
I1021 18:46:36.713783 154 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3
[2024-10-21 18:46:44,351] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 151) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

Failures:
[1]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 5 (local_rank: 1)
exitcode : 1 (pid: 152)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 6 (local_rank: 2)
exitcode : 1 (pid: 153)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 7 (local_rank: 3)
exitcode : 1 (pid: 154)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 4 (local_rank: 0)
exitcode : 1 (pid: 151)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code#

task1 日志： root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload [2024-10-21 18:46:04,204] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] [2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] ***************************************** [2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] ***************************************** [2024-10-21 18:46:14,015] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:46:14,036] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:46:14,054] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:46:14,069] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:46:16,456] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:46:16.457777 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303812352 I1021 18:46:16.458443 152 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303698768 I1021 18:46:16.458766 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=304961568 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:46:16,460] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:46:16.462054 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309843648 I1021 18:46:16.463582 151 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309729440 I1021 18:46:16.464460 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310991744 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:46:16,466] [INFO] [comm.py:637:init_distributed] cdb=None /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:46:16.467456 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307881488 [2024-10-21 18:46:16,467] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:46:16.468452 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317008944 I1021 18:46:16.468518 154 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307768832 I1021 18:46:16.468761 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309028992 I1021 18:46:16.469139 153 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316894864 I1021 18:46:16.469415 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=318156256 /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> Traceback (most recent call last): main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self)return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ obj = obj_cls(**args) # type: ignore return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader dataloader = runner.build_dataloader(dataloader = runner.build_dataloader(dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build dataset = DATASETS.build(dataset_cfg)dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset obj = obj_cls(**args) # type: ignoreobj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) return func(*args, **kwargs) return func(*args, **kwargs) return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list object_list[i] = _tensor_to_object(obj_view, obj_size) object_list[i] = _tensor_to_object(obj_view, obj_size) object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ return _unpickler(io.BytesIO(buf)).load() return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ table = _memory_mapped_arrow_table_from_file(path)table = _memory_mapped_arrow_table_from_file(path) table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status FileNotFoundErrorFileNotFoundError: : FileNotFoundErrorFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory I1021 18:46:36.582319 152 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1 I1021 18:46:36.606400 151 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0 I1021 18:46:36.653246 153 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2 I1021 18:46:36.713783 154 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3 [2024-10-21 18:46:44,351] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 151) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-10-21_18:46:44 host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 5 (local_rank: 1) exitcode : 1 (pid: 152) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-10-21_18:46:44 host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 6 (local_rank: 2) exitcode : 1 (pid: 153) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-10-21_18:46:44 host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 7 (local_rank: 3) exitcode : 1 (pid: 154) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-21_18:46:44 host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 4 (local_rank: 0) exitcode : 1 (pid: 151) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code#

11252177484cs commented

2024-10-21 18:52:07 +08:00

请详细看下，从task1 日志来看，好像时报缺了一个什么文件：FileNotFoundErrorFileNotFoundError: : FileNotFoundErrorFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file

11252177484cs commented

2024-10-21 19:00:07 +08:00

/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /bin/sh: 1: /bin/sh: 1: /opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found
还有这个错误

/bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /bin/sh: 1: /bin/sh: 1: /opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found 还有这个错误

21547230244cs commented

2024-10-23 17:06:52 +08:00

仔细看过这个确实是因为没有设置环境变量导致的（hf_home没设置）。

/code在平台上是共享目录，所以要设置HF_HOME在共享目录下。

仔细看过这个确实是因为没有设置环境变量导致的（hf_home没设置）。 /code在平台上是共享目录，所以要设置HF_HOME在共享目录下。

Sign in to join this conversation.