4 2 12345 0 llama2_7b_chat_qlora_sql_e3_copy.py [2024-10-29 17:28:30,395] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-29 17:28:37,662] torch.distributed.run: [WARNING] [2024-10-29 17:28:37,662] torch.distributed.run: [WARNING] ***************************************** [2024-10-29 17:28:37,662] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-10-29 17:28:37,662] torch.distributed.run: [WARNING] ***************************************** [2024-10-29 17:28:46,254] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-29 17:28:46,439] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-29 17:28:46,658] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-29 17:28:46,740] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-29 17:28:48,765] [INFO] [comm.py:637:init_distributed] cdb=None /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-29 17:28:48,872] [INFO] [comm.py:637:init_distributed] cdb=None /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-29 17:28:49,097] [INFO] [comm.py:637:init_distributed] cdb=None /opt/conda/lib/python3.10/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( [2024-10-29 17:28:49,127] [INFO] [comm.py:637:init_distributed] cdb=None [2024-10-29 17:28:49,127] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl WARNING: Logging before InitGoogleLogging() is written to STDERR I1029 17:28:52.772255 136 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=321886656 I1029 17:28:52.773535 136 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=321773456 I1029 17:28:52.774672 136 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=321986672 WARNING: Logging before InitGoogleLogging() is written to STDERR I1029 17:28:52.879035 135 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=315224512 I1029 17:28:52.880101 135 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=315110768 I1029 17:28:52.887578 135 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=315326624 WARNING: Logging before InitGoogleLogging() is written to STDERR I1029 17:28:53.103852 137 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299010080 I1029 17:28:53.105060 137 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=298898624 I1029 17:28:53.106081 137 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299110096 WARNING: Logging before InitGoogleLogging() is written to STDERR I1029 17:28:53.133014 134 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299187536 I1029 17:28:53.138247 134 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299074416 I1029 17:28:53.139688 134 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=299285600 rb74aec03d2847fb9029435ea69ec942-task0-0:134:134 [0] NCCL INFO Bootstrap : Using eth0:10.244.199.248<0> rb74aec03d2847fb9029435ea69ec942-task0-0:134:134 [0] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation RCCL version 2.13.4+hip5.7 HEAD:2890a73 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.244.199.248<0> rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Using network IB rb74aec03d2847fb9029435ea69ec942-task0-0:135:135 [1] NCCL INFO Bootstrap : Using eth0:10.244.199.248<0> rb74aec03d2847fb9029435ea69ec942-task0-0:135:135 [1] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. rb74aec03d2847fb9029435ea69ec942-task0-0:136:136 [2] NCCL INFO Bootstrap : Using eth0:10.244.199.248<0> rb74aec03d2847fb9029435ea69ec942-task0-0:136:136 [2] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0. rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.244.199.248<0> rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Using network IB rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.244.199.248<0> rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Using network IB rb74aec03d2847fb9029435ea69ec942-task0-0:137:137 [3] NCCL INFO Bootstrap : Using eth0:10.244.199.248<0> rb74aec03d2847fb9029435ea69ec942-task0-0:137:137 [3] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0. rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.244.199.248<0> rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Using network IB rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO rocm_smi_lib: version 2.8.0.0 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO rocm_smi_lib: version 2.8.0.0 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO rocm_smi_lib: version 2.8.0.0 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO rocm_smi_lib: version 2.8.0.0 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Setting affinity for GPU 3 to ff000000 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Setting affinity for GPU 2 to ff0000 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Setting affinity for GPU 0 to ff rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Setting affinity for GPU 1 to ff00 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 comm 0x7f1170001500 nRanks 08 busId 63000 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 comm 0x7fa354001500 nRanks 08 busId 43000 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 comm 0x7f9bdc001500 nRanks 08 busId 26000 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4 5 6 7 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01/04 : 0 3 2 5 4 7 6 1 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4 5 6 7 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03/04 : 0 3 2 5 4 7 6 1 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/4/-1->0->-1 [2] 1/-1/-1->0->4 [3] 1/-1/-1->0->4 comm 0x7f5824001500 nRanks 08 busId 4000 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00/0 : 7[63000] -> 0[4000] [receive] via NET/IB/0 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02/0 : 7[63000] -> 0[4000] [receive] via NET/IB/0 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00 : 0[4000] -> 1[26000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02 : 0[4000] -> 1[26000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 00 : 1[26000] -> 2[43000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 02 : 1[26000] -> 2[43000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 00 : 2[43000] -> 3[63000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 02 : 2[43000] -> 3[63000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 00/0 : 3[63000] -> 4[4000] [send] via NET/IB/3 comm 0x7f1170001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 02/0 : 3[63000] -> 4[4000] [send] via NET/IB/3 comm 0x7f1170001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 01/0 : 6[43000] -> 1[26000] [receive] via NET/IB/1 comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 03/0 : 6[43000] -> 1[26000] [receive] via NET/IB/1 comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 01/0 : 2[43000] -> 5[26000] [send] via NET/IB/2 comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 03/0 : 2[43000] -> 5[26000] [send] via NET/IB/2 comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01 : 0[4000] -> 3[63000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03 : 0[4000] -> 3[63000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 01 : 3[63000] -> 2[43000] via SHM/direct/direct comm 0x7f1170001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 03 : 3[63000] -> 2[43000] via SHM/direct/direct comm 0x7f1170001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Connected all rings comm 0x7fa354001500 nRanks 08 busId 43000 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Connected all rings comm 0x7f1170001500 nRanks 08 busId 63000 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 01 : 2[43000] -> 3[63000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 03 : 2[43000] -> 3[63000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 00 : 3[63000] -> 2[43000] via SHM/direct/direct comm 0x7f1170001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Channel 02 : 3[63000] -> 2[43000] via SHM/direct/direct comm 0x7f1170001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 01 : 1[26000] -> 0[4000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 03 : 1[26000] -> 0[4000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Connected all rings comm 0x7f5824001500 nRanks 08 busId 4000 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01 : 0[4000] -> 1[26000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03 : 0[4000] -> 1[26000] via SHM/direct/direct comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Connected all rings comm 0x7f9bdc001500 nRanks 08 busId 26000 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 01 : 1[26000] -> 2[43000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 03 : 1[26000] -> 2[43000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00/0 : 4[4000] -> 0[4000] [receive] via NET/IB/0 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01/0 : 4[4000] -> 0[4000] [receive] via NET/IB/1 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02/0 : 4[4000] -> 0[4000] [receive] via NET/IB/0 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03/0 : 4[4000] -> 0[4000] [receive] via NET/IB/1 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 00/0 : 0[4000] -> 4[4000] [send] via NET/IB/0 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 01/0 : 0[4000] -> 4[4000] [send] via NET/IB/1 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 02/0 : 0[4000] -> 4[4000] [send] via NET/IB/0 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Channel 03/0 : 0[4000] -> 4[4000] [send] via NET/IB/1 comm 0x7f5824001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 00 : 2[43000] -> 1[26000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 01 : 2[43000] -> 1[26000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 02 : 2[43000] -> 1[26000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Channel 03 : 2[43000] -> 1[26000] via SHM/direct/direct comm 0x7fa354001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 00 : 1[26000] -> 0[4000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Channel 02 : 1[26000] -> 0[4000] via SHM/direct/direct comm 0x7f9bdc001500 nRanks 08 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO Connected all trees comm 0x7f1170001500 nRanks 08 busId 63000 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/256 rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO Connected all trees comm 0x7fa354001500 nRanks 08 busId 43000 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/256 rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Connected all trees comm 0x7f5824001500 nRanks 08 busId 4000 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO Using tuning table 0 with LL128 disabled rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/256 rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO Connected all trees comm 0x7f9bdc001500 nRanks 08 busId 26000 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/256 rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer rb74aec03d2847fb9029435ea69ec942-task0-0:137:218 [3] NCCL INFO comm 0x7f1170001500 rank 3 nranks 8 cudaDev 3 busId 63000 localSize 136 used 29536 bytes - Init COMPLETE rb74aec03d2847fb9029435ea69ec942-task0-0:136:210 [2] NCCL INFO comm 0x7fa354001500 rank 2 nranks 8 cudaDev 2 busId 43000 localSize 136 used 29536 bytes - Init COMPLETE rb74aec03d2847fb9029435ea69ec942-task0-0:135:208 [1] NCCL INFO comm 0x7f9bdc001500 rank 1 nranks 8 cudaDev 1 busId 26000 localSize 136 used 29536 bytes - Init COMPLETE rb74aec03d2847fb9029435ea69ec942-task0-0:134:203 [0] NCCL INFO comm 0x7f5824001500 rank 0 nranks 8 cudaDev 0 busId 4000 localSize 136 used 29536 bytes - Init COMPLETE I1029 17:28:54.746687 134 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: INFO /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found /bin/sh: 1: /opt/dtk/bin/nvcc: not found 10/29 17:28:54 - mmengine - INFO - ------------------------------------------------------------ System environment: sys.platform: linux Python: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 1174821729 GPU 0,1,2,3: Z100SM CUDA_HOME: /opt/dtk NVCC: Not Available GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.1.0 PyTorch compiling details: PyTorch built with: - GCC 7.3 - C++ Version: 201703 - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX2 - HIP Runtime 5.7.24164 - MIOpen 2.15.4 - Magma 2.7.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=1, USE_GLOG=1, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=1, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=ON, TorchVision: 0.16.0 OpenCV: 4.9.0 MMEngine: 0.10.3 Runtime environment: launcher: pytorch randomness: {'seed': None, 'deterministic': False} cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: None deterministic: False Distributed launcher: pytorch Distributed training: True GPU number: 8 ------------------------------------------------------------ 10/29 17:28:55 - mmengine - INFO - Config: SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.sql' accumulative_counts = 8 batch_size = 4 betas = ( 0.9, 0.999, ) custom_hooks = [ dict( tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.hooks.DatasetInfoHook'), ] data_path = '/dataset/datasets/sql_datasets' dataloader_num_workers = 0 default_hooks = dict( checkpoint=dict( by_epoch=False, interval=500, max_keep_ckpts=2, type='mmengine.hooks.CheckpointHook'), logger=dict( interval=10, log_metric_by_epoch=False, type='mmengine.hooks.LoggerHook'), param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'), sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'), timer=dict(type='mmengine.hooks.IterTimerHook')) env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0)) evaluation_freq = 500 evaluation_inputs = [ 'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.', 'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。', ] launcher = 'pytorch' load_from = None log_level = 'INFO' log_processor = dict(by_epoch=False) lr = 0.0002 max_dataset_length = 16000 max_epochs = 1 max_length = 2048 max_norm = 1 model = dict( llm=dict( pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', torch_dtype='torch.bfloat16', trust_remote_code=True, type='transformers.AutoModelForCausalLM.from_pretrained'), lora=dict( bias='none', lora_alpha=16, lora_dropout=0.1, r=64, task_type='CAUSAL_LM', type='peft.LoraConfig'), type='xtuner.model.SupervisedFinetune', use_varlen_attn=False) optim_type = 'torch.optim.AdamW' optim_wrapper = dict( optimizer=dict( betas=( 0.9, 0.999, ), lr=0.0002, type='torch.optim.AdamW', weight_decay=0), type='DeepSpeedOptimWrapper') pack_to_max_length = False param_scheduler = [ dict( begin=0, by_epoch=True, convert_to_iter_based=True, end=0.03, start_factor=1e-05, type='mmengine.optim.LinearLR'), dict( begin=0.03, by_epoch=True, convert_to_iter_based=True, end=1, eta_min=0.0, type='mmengine.optim.CosineAnnealingLR'), ] pretrained_model_name_or_path = '/dataset/CodeLlama-7b-hf/' prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat' randomness = dict(deterministic=False, seed=None) resume = False runner_type = 'FlexibleRunner' sampler = 'mmengine.dataset.DefaultSampler' save_steps = 500 save_total_limit = 2 sequence_parallel_size = 1 strategy = dict( config=dict( bf16=dict(enabled=True), fp16=dict(enabled=False, initial_scale_power=16), gradient_accumulation_steps='auto', gradient_clipping='auto', train_micro_batch_size_per_gpu='auto', zero_allow_untested_optimizer=True, zero_force_ds_cpu_optimizer=False, zero_optimization=dict( offload_optimizer=dict(device='cpu', pin_memory=True), offload_param=dict(device='cpu', pin_memory=True), overlap_comm=True, stage=3, stage3_gather_16bit_weights_on_model_save=True)), exclude_frozen_parameters=True, gradient_accumulation_steps=8, gradient_clipping=1, sequence_parallel_size=1, train_micro_batch_size_per_gpu=4, type='xtuner.engine.DeepSpeedStrategy') tokenizer = dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained') train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop') train_dataloader = dict( batch_size=4, collate_fn=dict( type='xtuner.dataset.collate_fns.default_collate_fn', use_varlen_attn=False), dataset=dict( dataset=dict( path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn', max_dataset_length=16000, max_length=2048, pack_to_max_length=False, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False), num_workers=0, sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler')) train_dataset = dict( dataset=dict( path='/dataset/datasets/sql_datasets', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn', max_dataset_length=16000, max_length=2048, pack_to_max_length=False, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/dataset/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False) use_varlen_attn = False visualizer = None warmup_ratio = 0.03 weight_decay = 0 work_dir = '/userhome/xtuner-workdir-job' Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file resolved_file = hf_hub_download( File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/CodeLlama-7b-hf/'. Use `repo_type` argument if needed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main runner = RUNNERS.build(cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg runner = runner_cls.from_cfg(args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg runner = cls( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 403, in __init__ self.register_hooks(default_hooks, custom_hooks) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1430, in register_hooks self.register_custom_hooks(custom_hooks) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1410, in register_custom_hooks self.register_hook(hook) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1310, in register_hook hook_obj = HOOKS.build(hook) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg Traceback (most recent call last): obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/hooks/dataset_info_hook.py", line 24, in __init__ self.tokenizer = BUILDER.build(tokenizer)resolved_file = hf_hub_download( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file validate_repo_id(arg_value) File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/CodeLlama-7b-hf/'. Use `repo_type` argument if needed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained resolved_file = hf_hub_download( File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main validate_repo_id(arg_value) File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id runner = RUNNERS.build(cfg)raise HFValidationError( File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/CodeLlama-7b-hf/'. Use `repo_type` argument if needed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg runner = runner_cls.from_cfg(args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg runner = RUNNERS.build(cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build runner = cls( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 403, in __init__ return self.build_func(cfg, *args, **kwargs, registry=self) tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config runner = runner_cls.from_cfg(args) # type: ignoreself.register_hooks(default_hooks, custom_hooks) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1430, in register_hooks resolved_config_file = cached_file( File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 463, in cached_file runner = cls( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 403, in __init__ raise EnvironmentError( OSError: Incorrect path_or_model_id: '/dataset/CodeLlama-7b-hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub. self.register_hooks(default_hooks, custom_hooks) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1430, in register_hooks self.register_custom_hooks(custom_hooks) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1410, in register_custom_hooks self.register_custom_hooks(custom_hooks) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1410, in register_custom_hooks self.register_hook(hook) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1310, in register_hook self.register_hook(hook) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1310, in register_hook hook_obj = HOOKS.build(hook) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/hooks/dataset_info_hook.py", line 24, in __init__ hook_obj = HOOKS.build(hook) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build self.tokenizer = BUILDER.build(tokenizer) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/hooks/dataset_info_hook.py", line 24, in __init__ obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained self.tokenizer = BUILDER.build(tokenizer) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained resolved_config_file = cached_file( File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 463, in cached_file tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config raise EnvironmentError( OSError: Incorrect path_or_model_id: '/dataset/CodeLlama-7b-hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub. resolved_config_file = cached_file( File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 463, in cached_file raise EnvironmentError( OSError: Incorrect path_or_model_id: '/dataset/CodeLlama-7b-hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub. 10/29 17:28:55 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized. Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file resolved_file = hf_hub_download( File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dataset/CodeLlama-7b-hf/'. Use `repo_type` argument if needed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 335, in main runner = RUNNERS.build(cfg) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg runner = runner_cls.from_cfg(args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg runner = cls( File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 403, in __init__ self.register_hooks(default_hooks, custom_hooks) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1430, in register_hooks self.register_custom_hooks(custom_hooks) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1410, in register_custom_hooks self.register_hook(hook) File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1310, in register_hook hook_obj = HOOKS.build(hook) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/xtuner/engine/hooks/dataset_info_hook.py", line 24, in __init__ self.tokenizer = BUILDER.build(tokenizer) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 817, in from_pretrained tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 649, in get_tokenizer_config resolved_config_file = cached_file( File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 463, in cached_file raise EnvironmentError( OSError: Incorrect path_or_model_id: '/dataset/CodeLlama-7b-hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub. rb74aec03d2847fb9029435ea69ec942-task0-0:135:135 [1] NCCL INFO comm 0x7f9bdc001500 rank 1 nranks 8 cudaDev 1 busId 26000 - Abort COMPLETE I1029 17:28:55.471992 135 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1 rb74aec03d2847fb9029435ea69ec942-task0-0:137:137 [3] NCCL INFO comm 0x7f1170001500 rank 3 nranks 8 cudaDev 3 busId 63000 - Abort COMPLETE I1029 17:28:55.479300 137 ProcessGroupNCCL.cpp:874] [Rank 3] Destroyed 1communicators on CUDA device 3 rb74aec03d2847fb9029435ea69ec942-task0-0:136:136 [2] NCCL INFO comm 0x7fa354001500 rank 2 nranks 8 cudaDev 2 busId 43000 - Abort COMPLETE I1029 17:28:55.518657 136 ProcessGroupNCCL.cpp:874] [Rank 2] Destroyed 1communicators on CUDA device 2 rb74aec03d2847fb9029435ea69ec942-task0-0:134:134 [0] NCCL INFO comm 0x7f5824001500 rank 0 nranks 8 cudaDev 0 busId 4000 - Abort COMPLETE I1029 17:28:55.585083 134 ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0 [2024-10-29 17:29:01,794] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 134) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-10-29_17:29:01 host : rb74aec03d2847fb9029435ea69ec942-task0-0.rb74aec03d2847fb9029435ea69ec942.f80a0386d006481fbfcf1498f0e7e590.svc.cluster.local rank : 1 (local_rank: 1) exitcode : 1 (pid: 135) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-10-29_17:29:01 host : rb74aec03d2847fb9029435ea69ec942-task0-0.rb74aec03d2847fb9029435ea69ec942.f80a0386d006481fbfcf1498f0e7e590.svc.cluster.local rank : 2 (local_rank: 2) exitcode : 1 (pid: 136) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-10-29_17:29:01 host : rb74aec03d2847fb9029435ea69ec942-task0-0.rb74aec03d2847fb9029435ea69ec942.f80a0386d006481fbfcf1498f0e7e590.svc.cluster.local rank : 3 (local_rank: 3) exitcode : 1 (pid: 137) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-29_17:29:01 host : rb74aec03d2847fb9029435ea69ec942-task0-0.rb74aec03d2847fb9029435ea69ec942.f80a0386d006481fbfcf1498f0e7e590.svc.cluster.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 134) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================