求助贴---大模型04期Function Call实践作业问题 #394
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: HswOAuth/llm_course#394
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
作业是【进阶】使用function call数据测试微调大模型,提升大模型functioncall能力。详见附图1.
问题:
数据集下载完成,使用notebook: xtuner train /code/llama3_8b_instruct_qlora_alpaca_e3_copy.py --work-dir /userhome/llama3-8b-ft/function-calling --deepspeed deepspeed_zero3_offload 可正常运行到如下步骤。
11/23 14:51:20 - mmengine - INFO - Iter(train) [ 10/4329] lr: 1.4064e-05 eta: 1 day, 18:51:31 time: 35.7239 data_time: 0.0177 memory: 6909 loss: 0.8403
但是用训练管理任务进行微调训练,会报错,比较奇怪。训练任务如附图2.
一共采用了4台主机进行多机多卡训练,详见附件文件。
附图1:

附图2:

附上报错文件,共4个文件
平台上/code目录在训练管理模式下不是共享目录导致的。
run.sh脚本里要改下HF_HOME的环境变量
export HF_HOME=/userhome/huggingface-cache/