使用xtuner微调开源大模型失败 #65
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
报错信息:
4
6
12345
0
llama2_7b_chat_qlora_alpaca_e3_copy.py
[2025-01-21 21:33:24,721] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-21 21:33:30,940] torch.distributed.run: [WARNING]
[2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] *****************************************
[2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] *****************************************
[2025-01-21 21:33:41,738] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-21 21:33:41,740] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-21 21:33:41,896] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-21 21:33:42,038] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main
args.config = cfgs_name_path[args.config]
KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main
raise FileNotFoundError(f'Cannot find {args.config}')
FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main
args.config = cfgs_name_path[args.config]
KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main
raise FileNotFoundError(f'Cannot find {args.config}')
FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main
args.config = cfgs_name_path[args.config]
KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main
raise FileNotFoundError(f'Cannot find {args.config}')
FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main
args.config = cfgs_name_path[args.config]
KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main
raise FileNotFoundError(f'Cannot find {args.config}')
FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py
[2025-01-21 21:33:51,783] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 135 closing signal SIGTERM
[2025-01-21 21:33:51,783] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136 closing signal SIGTERM
[2025-01-21 21:33:51,897] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 133) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
Failures:
[1]:
time : 2025-01-21_21:33:51
host : s2532e6f3f3e4980a618a1d21773aa92-task0-0.s2532e6f3f3e4980a618a1d21773aa92.8d8ea7bd9b1841ef9f4dda08efef7ec7.svc.cluster.local
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 134)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2025-01-21_21:33:51
host : s2532e6f3f3e4980a618a1d21773aa92-task0-0.s2532e6f3f3e4980a618a1d21773aa92.8d8ea7bd9b1841ef9f4dda08efef7ec7.svc.cluster.local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 133)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
同学,问题求助请到 llm-course区域发布哈