求助贴 #215
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: HswOAuth/llm_course#215
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
11-24.8.29提示词工程部署-基德老师
这一课程,运行notebook 开展微调训练失败,从task0 日志来看,应该是请求被task1 拒绝。查看task1 日志,主要错误:
FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory
连续运行两次,都失败了。
命令输入是准确的
完整日志如下:
root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.47.249 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
[2024-10-17 19:12:30,056] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING]
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] *****************************************
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-17 19:12:34,807] torch.distributed.run: [WARNING] *****************************************
[2024-10-17 19:13:23,347] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:13:23,452] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:13:23,466] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-17 19:13:23,580] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:25,746] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:25.747555 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316361024
I1017 19:13:25.748116 435 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316249952
I1017 19:13:25.748484 435 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317510800
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:25,926] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:25.927821 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309610208
I1017 19:13:25.928458 436 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309497088
I1017 19:13:25.928822 436 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310757040
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:25,989] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:25.990494 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331114656
I1017 19:13:25.991246 438 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331001456
I1017 19:13:25.991477 438 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332262704
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-17 19:13:26,022] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1017 19:13:26.023967 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331345600
I1017 19:13:26.024638 437 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331230928
I1017 19:13:26.024919 437 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332494336
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
Traceback (most recent call last):
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
loop = LOOPS.build(return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
main()return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
return self.build_func(cfg, *args, **kwargs, registry=self)dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
loop = LOOPS.build(
return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
return func(*args, **kwargs)return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
dist.broadcast_object_list(objects, src=0)
dataloader = runner.build_dataloader( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
return _unpickler(io.BytesIO(buf)).load()return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
return _unpickler(io.BytesIO(buf)).load() memory_mapped_stream = pa.memory_map(filename)
table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
FileNotFoundErrorFileNotFoundError: : [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directoryFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory
FileNotFoundError: [Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory:
[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-8685b93f2724b679_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory
I1017 19:13:46.448297 437 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2
I1017 19:13:46.518741 438 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3
I1017 19:13:46.599797 435 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0
I1017 19:13:46.670159 436 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1
[2024-10-17 19:13:53,949] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 435) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
Failures:
[1]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 5 (local_rank: 1)
exitcode : 1 (pid: 436)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 6 (local_rank: 2)
exitcode : 1 (pid: 437)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 7 (local_rank: 3)
exitcode : 1 (pid: 438)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-10-17_19:13:53
host : ac7bde4c31ec4e699535bd897ffca227-task1-0.ac7bde4c31ec4e699535bd897ffca227.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 4 (local_rank: 0)
exitcode : 1 (pid: 435)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480
inet 10.244.96.101 netmask 255.255.255.255 broadcast 10.244.96.101
ether 8a:49:21:d5:d4:00 txqueuelen 0 (Ethernet)
RX packets 3061 bytes 587751 (587.7 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2433 bytes 6017827 (6.0 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 1454 bytes 441940 (441.9 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1454 bytes 441940 (441.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
root@ac7bde4c31ec4e699535bd897ffca227-task1-0:/code#
从提供的截图来看,应该是找不到训练的数据sql_datasets
进入两个notebook后,请先确保都code路径下;如果没有在code路径下,请先cd code
接下来就要设置环境变量:
在两台服务器上都配置HuggingFace镜像和网络代理:
export HF_HOME=/code/huggingface-cache/
export HF_ENDPOINT=https://hf-mirror.com
export http_proxy=http://10.10.9.50:3000
export https_proxy=http://10.10.9.50:3000
export no_proxy=localhost,127.0.0.1
设置对应的API-Key
export ZHIPUAI_API_KEY=*****
在两台服务器上分别输入:
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
接下来再启动训练
1\ 我是微调失败,不是训练失败。微调应该也不需要设置环境变量
2、都是在code 目录下面的,从日志截图可以看到
主机日志:
root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=0 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
[2024-10-21 18:43:19,765] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING]
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-21 18:43:24,541] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:44:27,005] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:44:27,067] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:44:27,105] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:44:27,223] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,322] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.323154 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302548912
I1021 18:44:29.323541 159 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302435168
I1021 18:44:29.324101 159 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303696912
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,486] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-21 18:44:29,486] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.486964 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=331055216
I1021 18:44:29.487335 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=330940544
I1021 18:44:29.487955 158 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=332205008
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,523] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.524077 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302394592
I1021 18:44:29.524586 160 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=302282032
I1021 18:44:29.525072 160 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303540880
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:44:29,578] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:44:29.579651 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323494592
I1021 18:44:29.580243 161 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=323381392
I1021 18:44:29.580680 161 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=324643504
I1021 18:44:31.178613 158 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
10/21 18:44:31 - mmengine - INFO -
System environment:
sys.platform: linux
Python: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1486578312
GPU 0,1,2,3: Z100SM
CUDA_HOME: /opt/dtk
NVCC: Not Available
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:
GCC 7.3
C++ Version: 201703
Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
HIP Runtime 5.7.24164
MIOpen 2.15.4
Magma 2.7.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=1, USE_GLOG=1, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=1, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=ON,
TorchVision: 0.16.0
OpenCV: 4.9.0
MMEngine: 0.10.3
Runtime environment:
launcher: pytorch
randomness: {'seed': None, 'deterministic': False}
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: None
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 8
10/21 18:44:31 - mmengine - INFO - Config:
SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.sql'
accumulative_counts = 8
batch_size = 4
betas = (
0.9,
0.999,
)
custom_hooks = [
dict(
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.engine.hooks.DatasetInfoHook'),
]
data_path = '/dataset/datasets/sql_datasets'
dataloader_num_workers = 0
default_hooks = dict(
checkpoint=dict(
by_epoch=False,
interval=500,
max_keep_ckpts=2,
type='mmengine.hooks.CheckpointHook'),
logger=dict(
interval=10,
log_metric_by_epoch=False,
type='mmengine.hooks.LoggerHook'),
param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'),
sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'),
timer=dict(type='mmengine.hooks.IterTimerHook'))
env_cfg = dict(
cudnn_benchmark=False,
dist_cfg=dict(backend='nccl'),
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
evaluation_freq = 500
evaluation_inputs = [
'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.',
'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。',
]
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False)
lr = 0.0002
max_dataset_length = 16000
max_epochs = 1
max_length = 2048
max_norm = 1
model = dict(
llm=dict(
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
torch_dtype='torch.bfloat16',
trust_remote_code=True,
type='transformers.AutoModelForCausalLM.from_pretrained'),
lora=dict(
bias='none',
lora_alpha=16,
lora_dropout=0.1,
r=64,
task_type='CAUSAL_LM',
type='peft.LoraConfig'),
type='xtuner.model.SupervisedFinetune',
use_varlen_attn=False)
optim_type = 'torch.optim.AdamW'
optim_wrapper = dict(
optimizer=dict(
betas=(
0.9,
0.999,
),
lr=0.0002,
type='torch.optim.AdamW',
weight_decay=0),
type='DeepSpeedOptimWrapper')
pack_to_max_length = False
param_scheduler = [
dict(
begin=0,
by_epoch=True,
convert_to_iter_based=True,
end=0.03,
start_factor=1e-05,
type='mmengine.optim.LinearLR'),
dict(
begin=0.03,
by_epoch=True,
convert_to_iter_based=True,
end=1,
eta_min=0.0,
type='mmengine.optim.CosineAnnealingLR'),
]
pretrained_model_name_or_path = '/dataset/CodeLlama-7b-hf/'
prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat'
randomness = dict(deterministic=False, seed=None)
resume = False
runner_type = 'FlexibleRunner'
sampler = 'mmengine.dataset.DefaultSampler'
save_steps = 500
save_total_limit = 2
sequence_parallel_size = 1
strategy = dict(
config=dict(
bf16=dict(enabled=True),
fp16=dict(enabled=False, initial_scale_power=16),
gradient_accumulation_steps='auto',
gradient_clipping='auto',
train_micro_batch_size_per_gpu='auto',
zero_allow_untested_optimizer=True,
zero_force_ds_cpu_optimizer=False,
zero_optimization=dict(
offload_optimizer=dict(device='cpu', pin_memory=True),
offload_param=dict(device='cpu', pin_memory=True),
overlap_comm=True,
stage=3,
stage3_gather_16bit_weights_on_model_save=True)),
exclude_frozen_parameters=True,
gradient_accumulation_steps=8,
gradient_clipping=1,
sequence_parallel_size=1,
train_micro_batch_size_per_gpu=4,
type='xtuner.engine.DeepSpeedStrategy')
tokenizer = dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained')
train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop')
train_dataloader = dict(
batch_size=4,
collate_fn=dict(
type='xtuner.dataset.collate_fns.default_collate_fn',
use_varlen_attn=False),
dataset=dict(
dataset=dict(
path='/dataset/datasets/sql_datasets',
type='datasets.load_dataset'),
dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
max_dataset_length=16000,
max_length=2048,
pack_to_max_length=False,
remove_unused_columns=True,
shuffle_before_pack=True,
template_map_fn=dict(
template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
type='xtuner.dataset.map_fns.template_map_fn_factory'),
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.dataset.process_hf_dataset',
use_varlen_attn=False),
num_workers=0,
sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler'))
train_dataset = dict(
dataset=dict(
path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'),
dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
max_dataset_length=16000,
max_length=2048,
pack_to_max_length=False,
remove_unused_columns=True,
shuffle_before_pack=True,
template_map_fn=dict(
template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
type='xtuner.dataset.map_fns.template_map_fn_factory'),
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.dataset.process_hf_dataset',
use_varlen_attn=False)
use_varlen_attn = False
visualizer = None
warmup_ratio = 0.03
weight_decay = 0
work_dir = '/code/xtuner-workdir'
10/21 18:44:31 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
10/21 18:44:31 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook
before_train:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DatasetInfoHook
(VERY_LOW ) CheckpointHook
before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook
before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
before_val:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook
before_val_epoch:
(NORMAL ) IterTimerHook
before_val_iter:
(NORMAL ) IterTimerHook
after_val_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
after_val:
(VERY_HIGH ) RuntimeInfoHook
after_train:
(VERY_HIGH ) RuntimeInfoHook
(VERY_LOW ) CheckpointHook
before_test:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook
before_test_epoch:
(NORMAL ) IterTimerHook
before_test_iter:
(NORMAL ) IterTimerHook
after_test_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
after_test:
(VERY_HIGH ) RuntimeInfoHook
after_run:
(BELOW_NORMAL) LoggerHook
10/21 18:44:31 - mmengine - INFO - xtuner_dataset_timeout = 0:30:00
HF google storage unreachable. Downloading and preparing it from source
Generating train split: 78577 examples [00:00, 159509.50 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 32907.08 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 35358.70 examples/s]
Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 44504.11 examples/s]
Map (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:02<00:00, 7307.34 examples/s]
Filter (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 31808.07 examples/s]
Map (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 16000/16000 [00:00<00:00, 30408.74 examples/s]
10/21 18:44:49 - mmengine - WARNING - Dataset Dataset has no metainfo.
dataset_meta
in visualizer will be None.I1021 18:44:54.326004 214 ProcessGroupNCCL.cpp:391] [Rank 1] found async exception when checking for NCCL errors: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer
I1021 18:44:54.326390 214 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1
E1021 18:44:54.326437 214 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E1021 18:44:54.326457 214 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
E1021 18:44:54.326539 214 ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.13.4
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Call to recv from 10.244.125.225<45902> failed : Connection reset by peer
[2024-10-21 18:44:57,587] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 158 closing signal SIGTERM
[2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 160 closing signal SIGTERM
[2024-10-21 18:44:57,588] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 161 closing signal SIGTERM
[2024-10-21 18:44:57,753] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 159) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-10-21_18:44:57
host : odc6af99c5f9401ba2d75273ca4c97bf-task0-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 159)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 159
root@odc6af99c5f9401ba2d75273ca4c97bf-task0-0:/code#
task1 日志:
root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code# NPROC_PER_NODE=4 NNODES=2 PORT=12345 ADDR=10.244.111.231 NODE_RANK=1 xtuner train llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /code/xtuner-workdir --deepspeed deepspeed_zero3_offload
[2024-10-21 18:46:04,204] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING]
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-21 18:46:09,237] torch.distributed.run: [WARNING] *****************************************
[2024-10-21 18:46:14,015] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:14,036] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:14,054] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:46:14,069] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:46:16,456] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.457777 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303812352
I1021 18:46:16.458443 152 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=303698768
I1021 18:46:16.458766 152 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=304961568
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:46:16,460] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.462054 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309843648
I1021 18:46:16.463582 151 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309729440
I1021 18:46:16.464460 151 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=310991744
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
[2024-10-21 18:46:16,466] [INFO] [comm.py:637:init_distributed] cdb=None
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.467456 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307881488
[2024-10-21 18:46:16,467] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1021 18:46:16.468452 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=317008944
I1021 18:46:16.468518 154 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=307768832
I1021 18:46:16.468761 154 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=309028992
I1021 18:46:16.469139 153 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=316894864
I1021 18:46:16.469415 153 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=318156256
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 338, in main
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
runner.train()
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
self._train_loop = self.build_train_loop(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
loop = LOOPS.build(
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
obj = obj_cls(**args) # type: ignore
return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
dataloader = runner.build_dataloader(dataloader = runner.build_dataloader(dataloader = runner.build_dataloader(
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
dataset = DATASETS.build(dataset_cfg)dataset = DATASETS.build(dataset_cfg)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
obj = obj_cls(**args) # type: ignoreobj = obj_cls(**args) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
File "/opt/conda/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 314, in process_hf_dataset
dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)dist.broadcast_object_list(objects, src=0)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs) return func(*args, **kwargs)
return func(*args, **kwargs)
return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2630, in broadcast_object_list
object_list[i] = _tensor_to_object(obj_view, obj_size) object_list[i] = _tensor_to_object(obj_view, obj_size)
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2318, in _tensor_to_object
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
return _unpickler(io.BytesIO(buf)).load()
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
return _unpickler(io.BytesIO(buf)).load()
return _unpickler(io.BytesIO(buf)).load() File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 1028, in setstate
table = _memory_mapped_arrow_table_from_file(path)table = _memory_mapped_arrow_table_from_file(path)
table = _memory_mapped_arrow_table_from_file(path) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 1066, in pyarrow.lib.memory_map
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/io.pxi", line 1013, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
FileNotFoundErrorFileNotFoundError: : FileNotFoundErrorFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory: :
[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory
I1021 18:46:36.582319 152 ProcessGroupNCCL.cpp:874] [Rank 5] Destroyed 1communicators on CUDA device 1
I1021 18:46:36.606400 151 ProcessGroupNCCL.cpp:874] [Rank 4] Destroyed 1communicators on CUDA device 0
I1021 18:46:36.653246 153 ProcessGroupNCCL.cpp:874] [Rank 6] Destroyed 1communicators on CUDA device 2
I1021 18:46:36.713783 154 ProcessGroupNCCL.cpp:874] [Rank 7] Destroyed 1communicators on CUDA device 3
[2024-10-21 18:46:44,351] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 151) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
Failures:
[1]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 5 (local_rank: 1)
exitcode : 1 (pid: 152)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 6 (local_rank: 2)
exitcode : 1 (pid: 153)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 7 (local_rank: 3)
exitcode : 1 (pid: 154)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-10-21_18:46:44
host : odc6af99c5f9401ba2d75273ca4c97bf-task1-0.odc6af99c5f9401ba2d75273ca4c97bf.3245c3913f34458289c8b904f12aafd7.svc.cluster.local
rank : 4 (local_rank: 0)
exitcode : 1 (pid: 151)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
root@odc6af99c5f9401ba2d75273ca4c97bf-task1-0:/code#
请详细看下,从task1 日志来看,好像时报缺了一个什么文件:FileNotFoundErrorFileNotFoundError: : FileNotFoundErrorFileNotFoundError[Errno 2] Failed to open local file '/root/.cache/huggingface/datasets/sql_datasets/default/0.0.0/d8db58b2f3e1a837/cache-fcd5a1b9a1040bb8_00000_of_00032.arrow'. Detail: [errno 2] No such file or directory[Errno 2] Failed to open local file
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /bin/sh: 1: /bin/sh: 1: /opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found/opt/dtk/bin/nvcc: not found
还有这个错误
仔细看过 这个确实是因为没有设置环境变量导致的(hf_home没设置)。
/code在平台上是共享目录,所以要设置HF_HOME在共享目录下。