在进行RLHF微调任务中,训练完奖励模型之后,再进行PPO训练,出现的报错信息显示不支持刚刚训练的奖励模型再进行PPO训练,求助各位如何解决?是我的训练奖励模型出差错还是PPO训练出差错?(所有训练的基座大模型都是Qwen2-7b-instruct,actor和critic模型共用同一个基座大模型Qwen2-7b-instruct)。图1为报错信息,图2为训练奖励模型,图3为PPO训练,图4为训练好的奖励模型其中的ckpts #183
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: HswOAuth/llm_course#183
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
看报错是在这里
需要debug下,看看self.supports_rm_adapter为什么是false
看截屏是在平台上跑的,可以把代码【算法】共享下,我们debug看下什么原因
好的感谢,已共享,算法名叫RLHF-GLM4,但是我把基座模型换成Qwen了
rlhf.py
里在使用AutoModelForCausalLMWithValueHead
直接加载了qwen模型,这里是不对的。之前训练的reward没有用上需要加上
原来如此,感谢老师,学习到了