使用xtuner微调开源大模型失败 #65

Open
opened 2025-01-21 21:50:27 +08:00 by 11316404784cs · 1 comment

报错信息:
4
6
12345
0
llama2_7b_chat_qlora_alpaca_e3_copy.py
[2025-01-21 21:33:24,721] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-21 21:33:30,940] torch.distributed.run: [WARNING]
[2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] *****************************************
[2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] *****************************************
[2025-01-21 21:33:41,738] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-21 21:33:41,740] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-21 21:33:41,896] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-21 21:33:42,038] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main
args.config = cfgs_name_path[args.config]
KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main
raise FileNotFoundError(f'Cannot find {args.config}')
FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main
args.config = cfgs_name_path[args.config]
KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main
raise FileNotFoundError(f'Cannot find {args.config}')
FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main
args.config = cfgs_name_path[args.config]
KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main
raise FileNotFoundError(f'Cannot find {args.config}')
FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main
args.config = cfgs_name_path[args.config]
KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in
main()
File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main
raise FileNotFoundError(f'Cannot find {args.config}')
FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py
[2025-01-21 21:33:51,783] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 135 closing signal SIGTERM
[2025-01-21 21:33:51,783] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136 closing signal SIGTERM
[2025-01-21 21:33:51,897] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 133) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

Failures:
[1]:
time : 2025-01-21_21:33:51
host : s2532e6f3f3e4980a618a1d21773aa92-task0-0.s2532e6f3f3e4980a618a1d21773aa92.8d8ea7bd9b1841ef9f4dda08efef7ec7.svc.cluster.local
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 134)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-01-21_21:33:51
host : s2532e6f3f3e4980a618a1d21773aa92-task0-0.s2532e6f3f3e4980a618a1d21773aa92.8d8ea7bd9b1841ef9f4dda08efef7ec7.svc.cluster.local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 133)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

报错信息: 4 6 12345 0 llama2_7b_chat_qlora_alpaca_e3_copy.py [2025-01-21 21:33:24,721] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] [2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] ***************************************** [2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2025-01-21 21:33:30,940] torch.distributed.run: [WARNING] ***************************************** [2025-01-21 21:33:41,738] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-21 21:33:41,740] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-21 21:33:41,896] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-21 21:33:42,038] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main args.config = cfgs_name_path[args.config] KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main raise FileNotFoundError(f'Cannot find {args.config}') FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main args.config = cfgs_name_path[args.config] KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main raise FileNotFoundError(f'Cannot find {args.config}') FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main args.config = cfgs_name_path[args.config] KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main raise FileNotFoundError(f'Cannot find {args.config}') FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 115, in main args.config = cfgs_name_path[args.config] KeyError: '/code/llama2_7b_chat_qlora_alpaca_e3_copy.py' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 342, in <module> main() File "/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py", line 117, in main raise FileNotFoundError(f'Cannot find {args.config}') FileNotFoundError: Cannot find /code/llama2_7b_chat_qlora_alpaca_e3_copy.py [2025-01-21 21:33:51,783] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 135 closing signal SIGTERM [2025-01-21 21:33:51,783] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136 closing signal SIGTERM [2025-01-21 21:33:51,897] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 133) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-01-21_21:33:51 host : s2532e6f3f3e4980a618a1d21773aa92-task0-0.s2532e6f3f3e4980a618a1d21773aa92.8d8ea7bd9b1841ef9f4dda08efef7ec7.svc.cluster.local rank : 1 (local_rank: 1) exitcode : 1 (pid: 134) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-21_21:33:51 host : s2532e6f3f3e4980a618a1d21773aa92-task0-0.s2532e6f3f3e4980a618a1d21773aa92.8d8ea7bd9b1841ef9f4dda08efef7ec7.svc.cluster.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 133) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

同学,问题求助请到 llm-course区域发布哈

同学,问题求助请到 llm-course区域发布哈
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_share#65
No description provided.