求助贴 #215

Open
opened 2024-10-17 19:25:47 +08:00 by 11252177484cs · 7 comments

11-24.8.29提示词工程部署-基德老师
这一课程,运行notebook 开展微调训练失败,从task0 日志来看,应该是请求被task1 拒绝。查看task1 日志,主要错误:
FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory

连续运行两次,都失败了。
命令输入是准确的

完整日志如下:

root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.47.249 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
[2024-10-17 19:12:30,056] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING]
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] *****************************************
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] *****************************************
[2024-10-17 19:13:23,347] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:13:23,452] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:13:23,466] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:13:23,580] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:25,746] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:25.747555 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316361024
I1017 19:13:25.748116 435 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316249952
I1017 19:13:25.748484 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317510800
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:25,926] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:25.927821 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309610208
I1017 19:13:25.928458 436 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309497088
I1017 19:13:25.928822 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310757040
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:25,989] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:25.990494 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331114656
I1017 19:13:25.991246 438 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331001456
I1017 19:13:25.991477 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332262704
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:26,022] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:26.023967 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331345600
I1017 19:13:26.024638 437 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331230928
I1017 19:13:26.024919 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332494336
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
Traceback (most recent call last):
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
loop = LOOPS.build(return self.build_func(cfg, *args, **kwargs, registry=self)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
main()return self.build_func(cfg, *args, **kwargs, registry=self)

  File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
return self.build_func(cfg, *args, **kwargs, registry=self)dataset = DATASETS.build(dataset_cfg)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
loop = LOOPS.build(
return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
return func(*args, **kwargs)return self.build_func(cfg, *args, **kwargs, registry=self)

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
dist.broadcast_object_list(objects, src=0)
dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper

File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
return _unpickler(io.BytesIO(buf)).load()return func(*args, **kwargs)

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
return _unpickler(io.BytesIO(buf)).load() memory_mapped_stream = pa.memory_map(filename)

table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory

FileNotFoundError: [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory:
[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory
I1017 19:13:46.448297 437 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2
I1017 19:13:46.518741 438 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3
I1017 19:13:46.599797 435 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0
I1017 19:13:46.670159 436 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1
[2024-10-17 19:13:53,949] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 435) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

Failures:
[1]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 5 (local_rank: 1)
exitcode : 1 (pid: 436)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 6 (local_rank: 2)
exitcode : 1 (pid: 437)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 7 (local_rank: 3)
exitcode : 1 (pid: 438)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 4 (local_rank: 0)
exitcode : 1 (pid: 435)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480
inet 10.244.96.101 netmask 255.255.255.255 broadcast 10.244.96.101
ether 8a:49:21:d5:d4:00 txqueuelen 0 (Ethernet)
RX packets 3061 bytes 587751 (587.7 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2433 bytes 6017827 (6.0 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 1454 bytes 441940 (441.9 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1454 bytes 441940 (441.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code#

11-24.8.29提示词工程部署-基德老师 这一课程,运行notebook 开展微调训练失败,从task0 日志来看,应该是请求被task1 拒绝。查看task1 日志,主要错误: FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory 连续运行两次,都失败了。 命令输入是准确的 完整日志如下: root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.47.249 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload [2024-10-17 19:12:30,056] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] [2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] ***************************************** [2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] ***************************************** [2024-10-17 19:13:23,347] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:13:23,452] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:13:23,466] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-17 19:13:23,580] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-17 19:13:25,746] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 19:13:25.747555 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316361024 I1017 19:13:25.748116 435 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316249952 I1017 19:13:25.748484 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317510800 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-17 19:13:25,926] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 19:13:25.927821 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309610208 I1017 19:13:25.928458 436 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309497088 I1017 19:13:25.928822 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310757040 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-17 19:13:25,989] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 19:13:25.990494 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331114656 I1017 19:13:25.991246 438 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331001456 I1017 19:13:25.991477 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332262704 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-17 19:13:26,022] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 19:13:26.023967 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331345600 I1017 19:13:26.024638 437 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331230928 I1017 19:13:26.024919 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332494336 /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main Traceback (most recent call last): runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> loop = LOOPS.build(return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ main()return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ return self.build_func(cfg, *args, **kwargs, registry=self)dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg dist.broadcast_object_list(objects, src=0) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset loop = LOOPS.build( return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list dist.broadcast_object_list(objects, src=0) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg return func(*args, **kwargs)return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ dist.broadcast_object_list(objects, src=0) dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset dist.broadcast_object_list(objects, src=0) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object return _unpickler(io.BytesIO(buf)).load()return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file return _unpickler(io.BytesIO(buf)).load() memory_mapped_stream = pa.memory_map(filename) table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory FileNotFoundError: [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory: [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory I1017 19:13:46.448297 437 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2 I1017 19:13:46.518741 438 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3 I1017 19:13:46.599797 435 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0 I1017 19:13:46.670159 436 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1 [2024-10-17 19:13:53,949] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 435) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-10-17_19:13:53 host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 5 (local_rank: 1) exitcode : 1 (pid: 436) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-10-17_19:13:53 host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 6 (local_rank: 2) exitcode : 1 (pid: 437) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-10-17_19:13:53 host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 7 (local_rank: 3) exitcode : 1 (pid: 438) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-17_19:13:53 host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 4 (local_rank: 0) exitcode : 1 (pid: 435) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480 inet 10.244.96.101 netmask 255.255.255.255 broadcast 10.244.96.101 ether 8a:49:21:d5:d4:00 txqueuelen 0 (Ethernet) RX packets 3061 bytes 587751 (587.7 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2433 bytes 6017827 (6.0 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1000 (Local Loopback) RX packets 1454 bytes 441940 (441.9 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1454 bytes 441940 (441.9 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code#

从提供的截图来看,应该是找不到训练的数据sql_datasets

进入两个notebook后,请先确保都code路径下;如果没有在code路径下,请先cd code

image

接下来就要设置环境变量:

  1. 网络连接配置
    在两台服务器上都配置HuggingFace镜像和网络代理:
    export HF_HOME=/code/huggingface-cache/
    export HF_ENDPOINT=https://hf-mirror.com
    export http_proxy=http://10.10.9.50:3000
    export https_proxy=http://10.10.9.50:3000
    export no_proxy=localhost,127.0.0.1

设置对应的API-Key
export ZHIPUAI_API_KEY=*****

  1. 配置IB网卡
    在两台服务器上分别输入:
    export NCCL_DEBUG=INFO
    export NCCL_IB_DISABLE=0
    export NCCL_IB_HCA=mlx5
    export NCCL_SOCKET_IFNAME=eth0
    export GLOO_SOCKET_IFNAME=eth0

接下来再启动训练

从提供的截图来看,应该是找不到训练的数据sql_datasets 进入两个notebook后,请**先确保都code路径下**;如果没有在code路径下,请先cd code <img width="1124" alt="image" src="/attachments/9c1b608b-2ed1-4ed6-a014-27d8563fae6d"> 接下来就要设置环境变量: 1. 网络连接配置 在两台服务器上都配置HuggingFace镜像和网络代理: export HF_HOME=/code/huggingface-cache/ export HF_ENDPOINT=https://hf-mirror.com export http_proxy=http://10.10.9.50:3000 export https_proxy=http://10.10.9.50:3000 export no_proxy=localhost,127.0.0.1 设置对应的API-Key export ZHIPUAI_API_KEY=***** 2. 配置IB网卡 在两台服务器上分别输入: export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=0 export NCCL_IB_HCA=mlx5 export NCCL_SOCKET_IFNAME=eth0 export GLOO_SOCKET_IFNAME=eth0 接下来再启动训练
Author

1\ 我是微调失败,不是训练失败。微调应该也不需要设置环境变量
2、都是在code 目录下面的,从日志截图可以看到

1\ 我是微调失败,不是训练失败。微调应该也不需要设置环境变量 2、都是在code 目录下面的,从日志截图可以看到
Author

主机日志:
root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=0 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
[2024-10-21 18:43:19,765] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING]
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:44:27,005] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:44:27,067] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:44:27,105] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:44:27,223] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,322] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.323154 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302548912
I1021 18:44:29.323541 159 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302435168
I1021 18:44:29.324101 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303696912
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,486] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-21 18:44:29,486] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.486964 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331055216
I1021 18:44:29.487335 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=330940544
I1021 18:44:29.487955 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332205008
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,523] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.524077 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302394592
I1021 18:44:29.524586 160 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302282032
I1021 18:44:29.525072 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303540880
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,578] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.579651 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323494592
I1021 18:44:29.580243 161 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323381392
I1021 18:44:29.580680 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=324643504
I1021 18:44:31.178613 158 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
10/21 18:44:31 - mmengine - INFO -

System environment:
sys.platform: linux
Python: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1486578312
GPU 0,1,2,3: Z100SM
CUDA_HOME: /opt/dtk
NVCC: Not Available
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3

  • C++ Version: 201703

  • Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications

  • OpenMP 201511 (a.k.a. OpenMP 4.5)

  • LAPACK is enabled (usually provided by MKL)

  • NNPACK is enabled

  • CPU capability usage: AVX2

  • HIP Runtime 5.7.24164

  • MIOpen 2.15.4

  • Magma 2.7.2

  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=1, USE_GLOG=1, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=1, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=ON,

    TorchVision: 0.16.0
    OpenCV: 4.9.0
    MMEngine: 0.10.3

Runtime environment:
launcher: pytorch
randomness: {'seed': None, 'deterministic': False}
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: None
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 8

10/21 18:44:31 - mmengine - INFO - Config:
SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.sql'
accumulative_counts = 8
batch_size = 4
betas = (
0.9,
0.999,
)
custom_hooks = [
dict(
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.engine.hooks.DatasetInfoHook'),
]
data_path = '/dataset/datasets/sql_datasets'
dataloader_num_workers = 0
default_hooks = dict(
checkpoint=dict(
by_epoch=False,
interval=500,
max_keep_ckpts=2,
type='mmengine.hooks.CheckpointHook'),
logger=dict(
interval=10,
log_metric_by_epoch=False,
type='mmengine.hooks.LoggerHook'),
param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'),
sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'),
timer=dict(type='mmengine.hooks.IterTimerHook'))
env_cfg = dict(
cudnn_benchmark=False,
dist_cfg=dict(backend='nccl'),
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
evaluation_freq = 500
evaluation_inputs = [
'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.',
'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。',
]
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False)
lr = 0.0002
max_dataset_length = 16000
max_epochs = 1
max_length = 2048
max_norm = 1
model = dict(
llm=dict(
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
torch_dtype='torch.bfloat16',
trust_remote_code=True,
type='transformers.AutoModelForCausalLM.from_pretrained'),
lora=dict(
bias='none',
lora_alpha=16,
lora_dropout=0.1,
r=64,
task_type='CAUSAL_LM',
type='peft.LoraConfig'),
type='xtuner.model.SupervisedFinetune',
use_varlen_attn=False)
optim_type = 'torch.optim.AdamW'
optim_wrapper = dict(
optimizer=dict(
betas=(
0.9,
0.999,
),
lr=0.0002,
type='torch.optim.AdamW',
weight_decay=0),
type='DeepSpeedOptimWrapper')
pack_to_max_length = False
param_scheduler = [
dict(
begin=0,
by_epoch=True,
convert_to_iter_based=True,
end=0.03,
start_factor=1e-05,
type='mmengine.optim.LinearLR'),
dict(
begin=0.03,
by_epoch=True,
convert_to_iter_based=True,
end=1,
eta_min=0.0,
type='mmengine.optim.CosineAnnealingLR'),
]
pretrained_model_name_or_path = '/dataset/CodeLlama-7b-hf/'
prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat'
randomness = dict(deterministic=False, seed=None)
resume = False
runner_type = 'FlexibleRunner'
sampler = 'mmengine.dataset.DefaultSampler'
save_steps = 500
save_total_limit = 2
sequence_parallel_size = 1
strategy = dict(
config=dict(
bf16=dict(enabled=True),
fp16=dict(enabled=False, initial_scale_power=16),
gradient_accumulation_steps='auto',
gradient_clipping='auto',
train_micro_batch_size_per_gpu='auto',
zero_allow_untested_optimizer=True,
zero_force_ds_cpu_optimizer=False,
zero_optimization=dict(
offload_optimizer=dict(device='cpu', pin_memory=True),
offload_param=dict(device='cpu', pin_memory=True),
overlap_comm=True,
stage=3,
stage3_gather_16bit_weights_on_model_save=True)),
exclude_frozen_parameters=True,
gradient_accumulation_steps=8,
gradient_clipping=1,
sequence_parallel_size=1,
train_micro_batch_size_per_gpu=4,
type='xtuner.engine.DeepSpeedStrategy')
tokenizer = dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained')
train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop')
train_dataloader = dict(
batch_size=4,
collate_fn=dict(
type='xtuner.dataset.collate_fns.default_collate_fn',
use_varlen_attn=False),
dataset=dict(
dataset=dict(
path='/dataset/datasets/sql_datasets',
type='datasets.load_dataset'),
dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
max_dataset_length=16000,
max_length=2048,
pack_to_max_length=False,
remove_unused_columns=True,
shuffle_before_pack=True,
template_map_fn=dict(
template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
type='xtuner.dataset.map_fns.template_map_fn_factory'),
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.dataset.process_hf_dataset',
use_varlen_attn=False),
num_workers=0,
sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler'))
train_dataset = dict(
dataset=dict(
path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'),
dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
max_dataset_length=16000,
max_length=2048,
pack_to_max_length=False,
remove_unused_columns=True,
shuffle_before_pack=True,
template_map_fn=dict(
template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
type='xtuner.dataset.map_fns.template_map_fn_factory'),
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.dataset.process_hf_dataset',
use_varlen_attn=False)
use_varlen_attn = False
visualizer = None
warmup_ratio = 0.03
weight_decay = 0
work_dir = '/code/xtuner-workdir'

10/21 18:44:31 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
10/21 18:44:31 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook

before_train:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DatasetInfoHook
(VERY_LOW ) CheckpointHook

before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook

before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook

after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

before_val:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook

before_val_epoch:
(NORMAL ) IterTimerHook

before_val_iter:
(NORMAL ) IterTimerHook

after_val_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

after_val:
(VERY_HIGH ) RuntimeInfoHook

after_train:
(VERY_HIGH ) RuntimeInfoHook
(VERY_LOW ) CheckpointHook

before_test:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook

before_test_epoch:
(NORMAL ) IterTimerHook

before_test_iter:
(NORMAL ) IterTimerHook

after_test_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test:
(VERY_HIGH ) RuntimeInfoHook

after_run:
(BELOW_NORMAL) LoggerHook

10/21 18:44:31 - mmengine - INFO - xtuner_dataset_timeout = 0:30:00
HF google storage unreachable. Downloading and preparing it from source
Generating train split: 78577 examples [00:00, 159509.50 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 32907.08 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 35358.70 examples/s]
Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 44504.11 examples/s]
Map (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:02<00:00, 7307.34 examples/s]
Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 31808.07 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 30408.74 examples/s]
10/21 18:44:49 - mmengine - WARNING - Dataset Dataset has no metainfo. dataset_meta in visualizer will be None.
I1021 18:44:54.326004 214 ProcessGroupNCCL.cpp:391] [Rank 1] found async exception when checking for NCCL errors: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer
I1021 18:44:54.326390 214 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1
E1021 18:44:54.326437 214 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E1021 18:44:54.326457 214 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
E1021 18:44:54.326539 214 ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer
[2024-10-21 18:44:57,587] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 158 closing signal SIGTERM
[2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 160 closing signal SIGTERM
[2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 161 closing signal SIGTERM
[2024-10-21 18:44:57,753] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 159) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-21_18:44:57
host : odc6af99c5f9401ba2d75273ca4c97bf-task0-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 159)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 159

root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code#

主机日志: root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=0 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload [2024-10-21 18:43:19,765] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] [2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] ***************************************** [2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] ***************************************** [2024-10-21 18:44:27,005] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:44:27,067] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:44:27,105] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:44:27,223] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:44:29,322] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:44:29.323154 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302548912 I1021 18:44:29.323541 159 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302435168 I1021 18:44:29.324101 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303696912 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:44:29,486] [INFO] [comm.py:637:init_distributed] cdb=None [2024-10-21 18:44:29,486] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:44:29.486964 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331055216 I1021 18:44:29.487335 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=330940544 I1021 18:44:29.487955 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332205008 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:44:29,523] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:44:29.524077 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302394592 I1021 18:44:29.524586 160 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302282032 I1021 18:44:29.525072 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303540880 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:44:29,578] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:44:29.579651 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323494592 I1021 18:44:29.580243 161 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323381392 I1021 18:44:29.580680 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=324643504 I1021 18:44:31.178613 158 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found 10/21 18:44:31 - mmengine - INFO - ------------------------------------------------------------ System environment: sys.platform: linux Python: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 1486578312 GPU 0,1,2,3: Z100SM CUDA_HOME: /opt/dtk NVCC: Not Available GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.1.0 PyTorch compiling details: PyTorch built with: - GCC 7.3 - C++ Version: 201703 - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX2 - HIP Runtime 5.7.24164 - MIOpen 2.15.4 - Magma 2.7.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=1, USE_GLOG=1, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=1, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=ON, TorchVision: 0.16.0 OpenCV: 4.9.0 MMEngine: 0.10.3 Runtime environment: launcher: pytorch randomness: {'seed': None, 'deterministic': False} cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: None deterministic: False Distributed launcher: pytorch Distributed training: True GPU number: 8 ------------------------------------------------------------ 10/21 18:44:31 - mmengine - INFO - Config: SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.sql' accumulative_counts = 8 batch_size = 4 betas = ( 0.9, 0.999, ) custom_hooks = [ dict( tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.hooks.DatasetInfoHook'), ] data_path = '/dataset/datasets/sql_datasets' dataloader_num_workers = 0 default_hooks = dict( checkpoint=dict( by_epoch=False, interval=500, max_keep_ckpts=2, type='mmengine.hooks.CheckpointHook'), logger=dict( interval=10, log_metric_by_epoch=False, type='mmengine.hooks.LoggerHook'), param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'), sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'), timer=dict(type='mmengine.hooks.IterTimerHook')) env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0)) evaluation_freq = 500 evaluation_inputs = [ 'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.', 'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。', ] launcher = 'pytorch' load_from = None log_level = 'INFO' log_processor = dict(by_epoch=False) lr = 0.0002 max_dataset_length = 16000 max_epochs = 1 max_length = 2048 max_norm = 1 model = dict( llm=dict( pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', torch_dtype='torch.bfloat16', trust_remote_code=True, type='transformers.AutoModelForCausalLM.from_pretrained'), lora=dict( bias='none', lora_alpha=16, lora_dropout=0.1, r=64, task_type='CAUSAL_LM', type='peft.LoraConfig'), type='xtuner.model.SupervisedFinetune', use_varlen_attn=False) optim_type = 'torch.optim.AdamW' optim_wrapper = dict( optimizer=dict( betas=( 0.9, 0.999, ), lr=0.0002, type='torch.optim.AdamW', weight_decay=0), type='DeepSpeedOptimWrapper') pack_to_max_length = False param_scheduler = [ dict( begin=0, by_epoch=True, convert_to_iter_based=True, end=0.03, start_factor=1e-05, type='mmengine.optim.LinearLR'), dict( begin=0.03, by_epoch=True, convert_to_iter_based=True, end=1, eta_min=0.0, type='mmengine.optim.CosineAnnealingLR'), ] pretrained_model_name_or_path = '/dataset/CodeLlama-7b-hf/' prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat' randomness = dict(deterministic=False, seed=None) resume = False runner_type = 'FlexibleRunner' sampler = 'mmengine.dataset.DefaultSampler' save_steps = 500 save_total_limit = 2 sequence_parallel_size = 1 strategy = dict( config=dict( bf16=dict(enabled=True), fp16=dict(enabled=False, initial_scale_power=16), gradient_accumulation_steps='auto', gradient_clipping='auto', train_micro_batch_size_per_gpu='auto', zero_allow_untested_optimizer=True, zero_force_ds_cpu_optimizer=False, zero_optimization=dict( offload_optimizer=dict(device='cpu', pin_memory=True), offload_param=dict(device='cpu', pin_memory=True), overlap_comm=True, stage=3, stage3_gather_16bit_weights_on_model_save=True)), exclude_frozen_parameters=True, gradient_accumulation_steps=8, gradient_clipping=1, sequence_parallel_size=1, train_micro_batch_size_per_gpu=4, type='xtuner.engine.DeepSpeedStrategy') tokenizer = dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained') train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop') train_dataloader = dict( batch_size=4, collate_fn=dict( type='xtuner.dataset.collate_fns.default_collate_fn', use_varlen_attn=False), dataset=dict( dataset=dict( path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn', max_dataset_length=16000, max_length=2048, pack_to_max_length=False, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False), num_workers=0, sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler')) train_dataset = dict( dataset=dict( path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn', max_dataset_length=16000, max_length=2048, pack_to_max_length=False, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False) use_varlen_attn = False visualizer = None warmup_ratio = 0.03 weight_decay = 0 work_dir = '/code/xtuner-workdir' 10/21 18:44:31 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized. 10/21 18:44:31 - mmengine - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) RuntimeInfoHook (BELOW_NORMAL) LoggerHook -------------------- before_train: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DatasetInfoHook (VERY_LOW ) CheckpointHook -------------------- before_train_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DistSamplerSeedHook -------------------- before_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook -------------------- after_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook -------------------- after_train_epoch: (NORMAL ) IterTimerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook -------------------- before_val: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) DatasetInfoHook -------------------- before_val_epoch: (NORMAL ) IterTimerHook -------------------- before_val_iter: (NORMAL ) IterTimerHook -------------------- after_val_iter: (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook -------------------- after_val_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook -------------------- after_val: (VERY_HIGH ) RuntimeInfoHook -------------------- after_train: (VERY_HIGH ) RuntimeInfoHook (VERY_LOW ) CheckpointHook -------------------- before_test: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) DatasetInfoHook -------------------- before_test_epoch: (NORMAL ) IterTimerHook -------------------- before_test_iter: (NORMAL ) IterTimerHook -------------------- after_test_iter: (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook -------------------- after_test_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook -------------------- after_test: (VERY_HIGH ) RuntimeInfoHook -------------------- after_run: (BELOW_NORMAL) LoggerHook -------------------- 10/21 18:44:31 - mmengine - INFO - xtuner_dataset_timeout = 0:30:00 HF google storage unreachable. Downloading and preparing it from source Generating train split: 78577 examples [00:00, 159509.50 examples/s] Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 32907.08 examples/s] Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 35358.70 examples/s] Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 44504.11 examples/s] Map (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:02<00:00, 7307.34 examples/s] Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 31808.07 examples/s] Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 30408.74 examples/s] 10/21 18:44:49 - mmengine - WARNING - Dataset Dataset has no metainfo. ``dataset_meta`` in visualizer will be None. I1021 18:44:54.326004 214 ProcessGroupNCCL.cpp:391] [Rank 1] found async exception when checking for NCCL errors: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer I1021 18:44:54.326390 214 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1 E1021 18:44:54.326437 214 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. E1021 18:44:54.326457 214 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down. E1021 18:44:54.326539 214 ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer [2024-10-21 18:44:57,587] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 158 closing signal SIGTERM [2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 160 closing signal SIGTERM [2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 161 closing signal SIGTERM [2024-10-21 18:44:57,753] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 159) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-21_18:44:57 host : odc6af99c5f9401ba2d75273ca4c97bf-task0-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 1 (local_rank: 1) exitcode : -6 (pid: 159) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 159 ============================================================ root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code#
Author

task1 日志:
root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
[2024-10-21 18:46:04,204] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING]
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:46:14,015] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:14,036] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:14,054] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:14,069] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:46:16,456] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.457777 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303812352
I1021 18:46:16.458443 152 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303698768
I1021 18:46:16.458766 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=304961568
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:46:16,460] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.462054 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309843648
I1021 18:46:16.463582 151 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309729440
I1021 18:46:16.464460 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310991744
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:46:16,466] [INFO] [comm.py:637:init_distributed] cdb=None
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.467456 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307881488
[2024-10-21 18:46:16,467] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.468452 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317008944
I1021 18:46:16.468518 154 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307768832
I1021 18:46:16.468761 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309028992
I1021 18:46:16.469139 153 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316894864
I1021 18:46:16.469415 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=318156256
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)return self.build_func(cfg, *args, **kwargs, registry=self)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
obj = obj_cls(**args) # type: ignore
return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
dataloader = runner.build_dataloader(dataloader = runner.build_dataloader(dataloader = runner.build_dataloader(

File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
dataset = DATASETS.build(dataset_cfg)dataset = DATASETS.build(dataset_cfg)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
return self.build_func(cfg, *args, **kwargs, registry=self)

File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
obj = obj_cls(**args) # type: ignoreobj = obj_cls(**args) # type: ignore

File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs) return func(*args, **kwargs)
return func(*args, **kwargs)
return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
object_list[i] = _tensor_to_object(obj_view, obj_size) object_list[i] = _tensor_to_object(obj_view, obj_size)
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object

File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
return _unpickler(io.BytesIO(buf)).load()
return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
table = _memory_mapped_arrow_table_from_file(path)table = _memory_mapped_arrow_table_from_file(path)

table = _memory_mapped_arrow_table_from_file(path)  File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file

table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file

File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
FileNotFoundErrorFileNotFoundError: : FileNotFoundErrorFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory: :

[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory

I1021 18:46:36.582319 152 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1
I1021 18:46:36.606400 151 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0
I1021 18:46:36.653246 153 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2
I1021 18:46:36.713783 154 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3
[2024-10-21 18:46:44,351] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 151) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

Failures:
[1]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 5 (local_rank: 1)
exitcode : 1 (pid: 152)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 6 (local_rank: 2)
exitcode : 1 (pid: 153)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 7 (local_rank: 3)
exitcode : 1 (pid: 154)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 4 (local_rank: 0)
exitcode : 1 (pid: 151)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code#

task1 日志: root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload [2024-10-21 18:46:04,204] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] [2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] ***************************************** [2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] ***************************************** [2024-10-21 18:46:14,015] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:46:14,036] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:46:14,054] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-21 18:46:14,069] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:46:16,456] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:46:16.457777 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303812352 I1021 18:46:16.458443 152 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303698768 I1021 18:46:16.458766 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=304961568 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:46:16,460] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:46:16.462054 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309843648 I1021 18:46:16.463582 151 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309729440 I1021 18:46:16.464460 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310991744 /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-21 18:46:16,466] [INFO] [comm.py:637:init_distributed] cdb=None /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:46:16.467456 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307881488 [2024-10-21 18:46:16,467] [INFO] [comm.py:637:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I1021 18:46:16.468452 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317008944 I1021 18:46:16.468518 154 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307768832 I1021 18:46:16.468761 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309028992 I1021 18:46:16.469139 153 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316894864 I1021 18:46:16.469415 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=318156256 /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> Traceback (most recent call last): main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train runner.train() File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop self._train_loop = self.build_train_loop( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build loop = LOOPS.build( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self)return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ obj = obj_cls(**args) # type: ignore return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__ dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader dataloader = runner.build_dataloader(dataloader = runner.build_dataloader(dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build dataset = DATASETS.build(dataset_cfg)dataset = DATASETS.build(dataset_cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset obj = obj_cls(**args) # type: ignoreobj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) return func(*args, **kwargs) return func(*args, **kwargs) return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list object_list[i] = _tensor_to_object(obj_view, obj_size) object_list[i] = _tensor_to_object(obj_view, obj_size) object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object object_list[i] = _tensor_to_object(obj_view, obj_size) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ return _unpickler(io.BytesIO(buf)).load() return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in __setstate__ table = _memory_mapped_arrow_table_from_file(path)table = _memory_mapped_arrow_table_from_file(path) table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status FileNotFoundErrorFileNotFoundError: : FileNotFoundErrorFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory I1021 18:46:36.582319 152 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1 I1021 18:46:36.606400 151 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0 I1021 18:46:36.653246 153 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2 I1021 18:46:36.713783 154 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3 [2024-10-21 18:46:44,351] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 151) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-10-21_18:46:44 host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 5 (local_rank: 1) exitcode : 1 (pid: 152) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-10-21_18:46:44 host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 6 (local_rank: 2) exitcode : 1 (pid: 153) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-10-21_18:46:44 host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 7 (local_rank: 3) exitcode : 1 (pid: 154) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-21_18:46:44 host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local rank : 4 (local_rank: 0) exitcode : 1 (pid: 151) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code#
Author

请详细看下,从task1 日志来看,好像时报缺了一个什么文件:FileNotFoundErrorFileNotFoundError: : FileNotFoundErrorFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file

请详细看下,从task1 日志来看,好像时报缺了一个什么文件:FileNotFoundErrorFileNotFoundError: : FileNotFoundErrorFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file
Author

/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /bin/sh: 1: /bin/sh: 1: /opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found
还有这个错误

/bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /bin/sh: 1: /bin/sh: 1: /opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found 还有这个错误

仔细看过 这个确实是因为没有设置环境变量导致的(hf_home没设置)。

/code在平台上是共享目录,所以要设置HF_HOME在共享目录下。

仔细看过 这个确实是因为没有设置环境变量导致的(hf_home没设置)。 /code在平台上是共享目录,所以要设置HF_HOME在共享目录下。
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_course#215
No description provided.