4
2
12345
0
llama2_7b_chat_qlora_sql_e3_copy.py
[2024-10-29 17:28:30,395] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:28:37,662] torch.distributed.run: [WARNING] 
[2024-10-29 17:28:37,662] torch.distributed.run: [WARNING] *****************************************
[2024-10-29 17:28:37,662] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-10-29 17:28:37,662] torch.distributed.run: [WARNING] *****************************************
[2024-10-29 17:28:46,254] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:28:46,439] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:28:46,658] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 17:28:46,740] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
[2024-10-29 17:28:48,765] [INFO] [comm.py:637:init_distributed] cdb=None
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
[2024-10-29 17:28:48,872] [INFO] [comm.py:637:init_distributed] cdb=None
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
[2024-10-29 17:28:49,097] [INFO] [comm.py:637:init_distributed] cdb=None
/opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
[2024-10-29 17:28:49,127] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-29 17:28:49,127] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1029 17:28:52.772255   136 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=321886656
I1029 17:28:52.773535   136 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=321773456
I1029 17:28:52.774672   136 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=321986672
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1029 17:28:52.879035   135 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=315224512
I1029 17:28:52.880101   135 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=315110768
I1029 17:28:52.887578   135 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=315326624
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1029 17:28:53.103852   137 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299010080
I1029 17:28:53.105060   137 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=298898624
I1029 17:28:53.106081   137 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299110096
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1029 17:28:53.133014   134 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299187536
I1029 17:28:53.138247   134 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299074416
I1029 17:28:53.139688   134 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299285600
rb74aec03d2847fb9029435ea69ec942-task0-0:134:134 [0] NCCL INFO Bootstrap : Using eth0:10.244.199.248<0>
rb74aec03d2847fb9029435ea69ec942-task0-0:134:134 [0] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation
RCCL version 2.13.4+hip5.7 HEAD:2890a73
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.244.199.248<0>
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Using network IB
rb74aec03d2847fb9029435ea69ec942-task0-0:135:135 [1] NCCL INFO Bootstrap : Using eth0:10.244.199.248<0>
rb74aec03d2847fb9029435ea69ec942-task0-0:135:135 [1] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
rb74aec03d2847fb9029435ea69ec942-task0-0:136:136 [2] NCCL INFO Bootstrap : Using eth0:10.244.199.248<0>
rb74aec03d2847fb9029435ea69ec942-task0-0:136:136 [2] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.244.199.248<0>
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Using network IB
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.244.199.248<0>
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Using network IB
rb74aec03d2847fb9029435ea69ec942-task0-0:137:137 [3] NCCL INFO Bootstrap : Using eth0:10.244.199.248<0>
rb74aec03d2847fb9029435ea69ec942-task0-0:137:137 [3] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.244.199.248<0>
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Using network IB
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO rocm_smi_lib: version 2.8.0.0
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO rocm_smi_lib: version 2.8.0.0
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO rocm_smi_lib: version 2.8.0.0
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO rocm_smi_lib: version 2.8.0.0
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Setting affinity for GPU 3 to ff000000
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Setting affinity for GPU 2 to ff0000
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Setting affinity for GPU 0 to ff
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Setting affinity for GPU 1 to ff00
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 comm 0x7f1170001500 nRanks 08 busId 63000
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 comm 0x7fa354001500 nRanks 08 busId 43000
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 comm 0x7f9bdc001500 nRanks 08 busId 26000
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00/04 :    0   1   2   3   4   5   6   7
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01/04 :    0   3   2   5   4   7   6   1
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02/04 :    0   1   2   3   4   5   6   7
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03/04 :    0   3   2   5   4   7   6   1
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/4/-1->0->-1 [2] 1/-1/-1->0->4 [3] 1/-1/-1->0->4 comm 0x7f5824001500 nRanks 08 busId 4000
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00/0 : 7[63000] -> 0[4000] [receive] via NET/IB/0 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02/0 : 7[63000] -> 0[4000] [receive] via NET/IB/0 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00 : 0[4000] -> 1[26000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02 : 0[4000] -> 1[26000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 00 : 1[26000] -> 2[43000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 02 : 1[26000] -> 2[43000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 00 : 2[43000] -> 3[63000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 02 : 2[43000] -> 3[63000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 00/0 : 3[63000] -> 4[4000] [send] via NET/IB/3 comm 0x7f1170001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 02/0 : 3[63000] -> 4[4000] [send] via NET/IB/3 comm 0x7f1170001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 01/0 : 6[43000] -> 1[26000] [receive] via NET/IB/1 comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 03/0 : 6[43000] -> 1[26000] [receive] via NET/IB/1 comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 01/0 : 2[43000] -> 5[26000] [send] via NET/IB/2 comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 03/0 : 2[43000] -> 5[26000] [send] via NET/IB/2 comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01 : 0[4000] -> 3[63000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03 : 0[4000] -> 3[63000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 01 : 3[63000] -> 2[43000] via SHM/direct/direct comm 0x7f1170001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 03 : 3[63000] -> 2[43000] via SHM/direct/direct comm 0x7f1170001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Connected all rings comm 0x7fa354001500 nRanks 08 busId 43000
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Connected all rings comm 0x7f1170001500 nRanks 08 busId 63000
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 01 : 2[43000] -> 3[63000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 03 : 2[43000] -> 3[63000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 00 : 3[63000] -> 2[43000] via SHM/direct/direct comm 0x7f1170001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 02 : 3[63000] -> 2[43000] via SHM/direct/direct comm 0x7f1170001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 01 : 1[26000] -> 0[4000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 03 : 1[26000] -> 0[4000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Connected all rings comm 0x7f5824001500 nRanks 08 busId 4000
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01 : 0[4000] -> 1[26000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03 : 0[4000] -> 1[26000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Connected all rings comm 0x7f9bdc001500 nRanks 08 busId 26000
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 01 : 1[26000] -> 2[43000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 03 : 1[26000] -> 2[43000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00/0 : 4[4000] -> 0[4000] [receive] via NET/IB/0 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01/0 : 4[4000] -> 0[4000] [receive] via NET/IB/1 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02/0 : 4[4000] -> 0[4000] [receive] via NET/IB/0 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03/0 : 4[4000] -> 0[4000] [receive] via NET/IB/1 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00/0 : 0[4000] -> 4[4000] [send] via NET/IB/0 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01/0 : 0[4000] -> 4[4000] [send] via NET/IB/1 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02/0 : 0[4000] -> 4[4000] [send] via NET/IB/0 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03/0 : 0[4000] -> 4[4000] [send] via NET/IB/1 comm 0x7f5824001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 00 : 2[43000] -> 1[26000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 01 : 2[43000] -> 1[26000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 02 : 2[43000] -> 1[26000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 03 : 2[43000] -> 1[26000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 00 : 1[26000] -> 0[4000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 02 : 1[26000] -> 0[4000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Connected all trees comm 0x7f1170001500 nRanks 08 busId 63000
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/256
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Connected all trees comm 0x7fa354001500 nRanks 08 busId 43000
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/256
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Connected all trees comm 0x7f5824001500 nRanks 08 busId 4000
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Using tuning table 0 with LL128 disabled
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/256
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Connected all trees comm 0x7f9bdc001500 nRanks 08 busId 26000
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/256
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO comm 0x7f1170001500 rank 3 nranks 8 cudaDev 3 busId 63000 localSize 136 used 29536 bytes - Init COMPLETE
rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO comm 0x7fa354001500 rank 2 nranks 8 cudaDev 2 busId 43000 localSize 136 used 29536 bytes - Init COMPLETE
rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO comm 0x7f9bdc001500 rank 1 nranks 8 cudaDev 1 busId 26000 localSize 136 used 29536 bytes - Init COMPLETE
rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO comm 0x7f5824001500 rank 0 nranks 8 cudaDev 0 busId 4000 localSize 136 used 29536 bytes - Init COMPLETE
I1029 17:28:54.746687   134 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: INFO
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
/bin/sh: 1: /opt/dtk/bin/nvcc: not found
10/29 17:28:54 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.10.8 (main, Nov  4 2022, 13:48:29) [GCC 11.2.0]
    CUDA available: True
    MUSA available: False
    numpy_random_seed: 1174821729
    GPU 0,1,2,3: Z100SM
    CUDA_HOME: /opt/dtk
    NVCC: Not Available
    GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
    PyTorch: 2.1.0
    PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201703
  - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - HIP Runtime 5.7.24164
  - MIOpen 2.15.4
  - Magma 2.7.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=1, USE_GLOG=1, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=1, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=ON, 

    TorchVision: 0.16.0
    OpenCV: 4.9.0
    MMEngine: 0.10.3

Runtime environment:
    launcher: pytorch
    randomness: {'seed': None, 'deterministic': False}
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: None
    deterministic: False
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 8
------------------------------------------------------------

10/29 17:28:55 - mmengine - INFO - Config:
SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.sql'
accumulative_counts = 8
batch_size = 4
betas = (
    0.9,
    0.999,
)
custom_hooks = [
    dict(
        tokenizer=dict(
            padding_side='right',
            pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
            trust_remote_code=True,
            type='transformers.AutoTokenizer.from_pretrained'),
        type='xtuner.engine.hooks.DatasetInfoHook'),
]
data_path = '/dataset/datasets/sql_datasets'
dataloader_num_workers = 0
default_hooks = dict(
    checkpoint=dict(
        by_epoch=False,
        interval=500,
        max_keep_ckpts=2,
        type='mmengine.hooks.CheckpointHook'),
    logger=dict(
        interval=10,
        log_metric_by_epoch=False,
        type='mmengine.hooks.LoggerHook'),
    param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'),
    sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'),
    timer=dict(type='mmengine.hooks.IterTimerHook'))
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
evaluation_freq = 500
evaluation_inputs = [
    'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.',
    'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。',
]
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False)
lr = 0.0002
max_dataset_length = 16000
max_epochs = 1
max_length = 2048
max_norm = 1
model = dict(
    llm=dict(
        pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
        torch_dtype='torch.bfloat16',
        trust_remote_code=True,
        type='transformers.AutoModelForCausalLM.from_pretrained'),
    lora=dict(
        bias='none',
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        task_type='CAUSAL_LM',
        type='peft.LoraConfig'),
    type='xtuner.model.SupervisedFinetune',
    use_varlen_attn=False)
optim_type = 'torch.optim.AdamW'
optim_wrapper = dict(
    optimizer=dict(
        betas=(
            0.9,
            0.999,
        ),
        lr=0.0002,
        type='torch.optim.AdamW',
        weight_decay=0),
    type='DeepSpeedOptimWrapper')
pack_to_max_length = False
param_scheduler = [
    dict(
        begin=0,
        by_epoch=True,
        convert_to_iter_based=True,
        end=0.03,
        start_factor=1e-05,
        type='mmengine.optim.LinearLR'),
    dict(
        begin=0.03,
        by_epoch=True,
        convert_to_iter_based=True,
        end=1,
        eta_min=0.0,
        type='mmengine.optim.CosineAnnealingLR'),
]
pretrained_model_name_or_path = '/dataset/CodeLlama-7b-hf/'
prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat'
randomness = dict(deterministic=False, seed=None)
resume = False
runner_type = 'FlexibleRunner'
sampler = 'mmengine.dataset.DefaultSampler'
save_steps = 500
save_total_limit = 2
sequence_parallel_size = 1
strategy = dict(
    config=dict(
        bf16=dict(enabled=True),
        fp16=dict(enabled=False, initial_scale_power=16),
        gradient_accumulation_steps='auto',
        gradient_clipping='auto',
        train_micro_batch_size_per_gpu='auto',
        zero_allow_untested_optimizer=True,
        zero_force_ds_cpu_optimizer=False,
        zero_optimization=dict(
            offload_optimizer=dict(device='cpu', pin_memory=True),
            offload_param=dict(device='cpu', pin_memory=True),
            overlap_comm=True,
            stage=3,
            stage3_gather_16bit_weights_on_model_save=True)),
    exclude_frozen_parameters=True,
    gradient_accumulation_steps=8,
    gradient_clipping=1,
    sequence_parallel_size=1,
    train_micro_batch_size_per_gpu=4,
    type='xtuner.engine.DeepSpeedStrategy')
tokenizer = dict(
    padding_side='right',
    pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
    trust_remote_code=True,
    type='transformers.AutoTokenizer.from_pretrained')
train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop')
train_dataloader = dict(
    batch_size=4,
    collate_fn=dict(
        type='xtuner.dataset.collate_fns.default_collate_fn',
        use_varlen_attn=False),
    dataset=dict(
        dataset=dict(
            path='/dataset/datasets/sql_datasets',
            type='datasets.load_dataset'),
        dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
        max_dataset_length=16000,
        max_length=2048,
        pack_to_max_length=False,
        remove_unused_columns=True,
        shuffle_before_pack=True,
        template_map_fn=dict(
            template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
            type='xtuner.dataset.map_fns.template_map_fn_factory'),
        tokenizer=dict(
            padding_side='right',
            pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
            trust_remote_code=True,
            type='transformers.AutoTokenizer.from_pretrained'),
        type='xtuner.dataset.process_hf_dataset',
        use_varlen_attn=False),
    num_workers=0,
    sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler'))
train_dataset = dict(
    dataset=dict(
        path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'),
    dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
    max_dataset_length=16000,
    max_length=2048,
    pack_to_max_length=False,
    remove_unused_columns=True,
    shuffle_before_pack=True,
    template_map_fn=dict(
        template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
        type='xtuner.dataset.map_fns.template_map_fn_factory'),
    tokenizer=dict(
        padding_side='right',
        pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/',
        trust_remote_code=True,
        type='transformers.AutoTokenizer.from_pretrained'),
    type='xtuner.dataset.process_hf_dataset',
    use_varlen_attn=False)
use_varlen_attn = False
visualizer = None
warmup_ratio = 0.03
weight_decay = 0
work_dir = '/userhome/xtuner-workdir-job'

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file
    resolved_file = hf_hub_download(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/CodeLlama-7b-hf/'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module>
    main()
  File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
    runner = RUNNERS.build(cfg)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
    runner = runner_cls.from_cfg(args)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
    runner = cls(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 403, in __init__
    self.register_hooks(default_hooks, custom_hooks)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1430, in register_hooks
    self.register_custom_hooks(custom_hooks)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1410, in register_custom_hooks
    self.register_hook(hook)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1310, in register_hook
    hook_obj = HOOKS.build(hook)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
Traceback (most recent call last):
    obj = obj_cls(**args)  # type: ignore  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file

  File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/hooks/dataset_info_hook.py", line 24, in __init__
        self.tokenizer = BUILDER.build(tokenizer)resolved_file = hf_hub_download(

  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file
    validate_repo_id(arg_value)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/CodeLlama-7b-hf/'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
      File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module>
obj = obj_cls(**args)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained
    resolved_file = hf_hub_download(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    main()
  File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
    validate_repo_id(arg_value)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
        runner = RUNNERS.build(cfg)raise HFValidationError(

  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/CodeLlama-7b-hf/'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module>
    main()
  File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
    runner = runner_cls.from_cfg(args)  # type: ignore
      File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
runner = RUNNERS.build(cfg)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    runner = cls(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 403, in __init__
    return self.build_func(cfg, *args, **kwargs, registry=self)    
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config
        runner = runner_cls.from_cfg(args)  # type: ignoreself.register_hooks(default_hooks, custom_hooks)

  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1430, in register_hooks
    resolved_config_file = cached_file(
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 463, in cached_file
    runner = cls(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 403, in __init__
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/dataset/CodeLlama-7b-hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub.    
self.register_hooks(default_hooks, custom_hooks)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1430, in register_hooks
    self.register_custom_hooks(custom_hooks)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1410, in register_custom_hooks
    self.register_custom_hooks(custom_hooks)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1410, in register_custom_hooks
    self.register_hook(hook)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1310, in register_hook
    self.register_hook(hook)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1310, in register_hook
    hook_obj = HOOKS.build(hook)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/hooks/dataset_info_hook.py", line 24, in __init__
    hook_obj = HOOKS.build(hook)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    self.tokenizer = BUILDER.build(tokenizer)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/hooks/dataset_info_hook.py", line 24, in __init__
    obj = obj_cls(**args)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained
    self.tokenizer = BUILDER.build(tokenizer)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config
    obj = obj_cls(**args)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained
    resolved_config_file = cached_file(
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 463, in cached_file
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/dataset/CodeLlama-7b-hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
    resolved_config_file = cached_file(
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 463, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/dataset/CodeLlama-7b-hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
10/29 17:28:55 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file
    resolved_file = hf_hub_download(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/CodeLlama-7b-hf/'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module>
    main()
  File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main
    runner = RUNNERS.build(cfg)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg
    runner = runner_cls.from_cfg(args)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg
    runner = cls(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 403, in __init__
    self.register_hooks(default_hooks, custom_hooks)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1430, in register_hooks
    self.register_custom_hooks(custom_hooks)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1410, in register_custom_hooks
    self.register_hook(hook)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1310, in register_hook
    hook_obj = HOOKS.build(hook)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/hooks/dataset_info_hook.py", line 24, in __init__
    self.tokenizer = BUILDER.build(tokenizer)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 463, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/dataset/CodeLlama-7b-hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
rb74aec03d2847fb9029435ea69ec942-task0-0:135:135 [1] NCCL INFO comm 0x7f9bdc001500 rank 1 nranks 8 cudaDev 1 busId 26000 - Abort COMPLETE
I1029 17:28:55.471992   135 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1
rb74aec03d2847fb9029435ea69ec942-task0-0:137:137 [3] NCCL INFO comm 0x7f1170001500 rank 3 nranks 8 cudaDev 3 busId 63000 - Abort COMPLETE
I1029 17:28:55.479300   137 ProcessGroupNCCL.cpp:874] [Rank 3] Destroyed 1communicators on CUDA device 3
rb74aec03d2847fb9029435ea69ec942-task0-0:136:136 [2] NCCL INFO comm 0x7fa354001500 rank 2 nranks 8 cudaDev 2 busId 43000 - Abort COMPLETE
I1029 17:28:55.518657   136 ProcessGroupNCCL.cpp:874] [Rank 2] Destroyed 1communicators on CUDA device 2
rb74aec03d2847fb9029435ea69ec942-task0-0:134:134 [0] NCCL INFO comm 0x7f5824001500 rank 0 nranks 8 cudaDev 0 busId 4000 - Abort COMPLETE
I1029 17:28:55.585083   134 ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0
[2024-10-29 17:29:01,794] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 134) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-10-29_17:29:01
  host      : rb74aec03d2847fb9029435ea69ec942-task0-0.rb74aec03d2847fb9029435ea69ec942.f80a0386d006481fbfcf1498f0e7e590.svc.cluster.local
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 135)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-10-29_17:29:01
  host      : rb74aec03d2847fb9029435ea69ec942-task0-0.rb74aec03d2847fb9029435ea69ec942.f80a0386d006481fbfcf1498f0e7e590.svc.cluster.local
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 136)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-10-29_17:29:01
  host      : rb74aec03d2847fb9029435ea69ec942-task0-0.rb74aec03d2847fb9029435ea69ec942.f80a0386d006481fbfcf1498f0e7e590.svc.cluster.local
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 137)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-29_17:29:01
  host      : rb74aec03d2847fb9029435ea69ec942-task0-0.rb74aec03d2847fb9029435ea69ec942.f80a0386d006481fbfcf1498f0e7e590.svc.cluster.local
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 134)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================