【求助帖】2024-10-20 基于LLaMA-Factory的模型微调训练过程中报错，查看控制台日志发现有一处有问题 #273

New Issue

guowenlong · 2024-10-23T12:08:32+08:00

guowenlong commented

2024-10-23 12:08:32 +08:00

发现有一处报错：Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory，如下图：

通过界面http://127.0.0.1:7860/ 看是：训练完毕，但是损失函数曲线图没出来。加载此检查点，提问后回答就不正常了。

发现有一处报错：Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory，如下图： ![image](/attachments/21c57cc3-5856-4421-86d0-65215b156735) 通过界面http://127.0.0.1:7860/ 看是：训练完毕，但是损失函数曲线图没出来。加载此检查点，提问后回答就不正常了。 ![image](/attachments/48605a6e-0382-43fa-bbed-a061365b0d92) ![image](/attachments/27ef2e4d-7ab9-4b80-87e2-ddad04cce56b) ![image](/attachments/67b84c1c-fe39-48af-8363-1cc22e2a31e0)

image.png

104 KiB

error.log.txt

24 KiB

预览命令.txt

965 B

image.png

106 KiB

image.png

182 KiB

image.png

166 KiB

guowenlong commented

2024-10-23 12:19:04 +08:00

补充：尝试了2台autodl实例都报上述错误：Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory，

guowenlong commented

2024-10-23 12:20:41 +08:00

使用实例详情：

使用实例详情： ![image](/attachments/a9b10228-75f5-4519-8f28-d567514784f2)

image.png

91 KiB

12390900721cs commented

2024-10-23 15:21:19 +08:00

这个问题很多同学都出现过；如果可以的话最好是上传一下操作视频方便排查喔。

21547230244cs commented

2024-10-23 15:22:21 +08:00

请问是在运行基座模型时出的问题还是在开始微调后出现的问题？

guowenlong commented

2024-10-23 16:23:20 +08:00

运行基座模型进行问答没有问题，是在微调过程中出现的该问题。我看了issue #261 ，应该跟他的情况一样，也看到了微调过程中报：Could not load library libnvrtc.so.12. Error:，所以加载此检查点后模型会乱回答。

21547230244cs commented

2024-10-23 18:49:26 +08:00

能否方便录屏呢？这样方便排查问题。

或者跟进#261 的帖子，这位同学已录屏。

能否方便录屏呢？这样方便排查问题。或者跟进https://hsw-git.huishiwei.cn/HswOAuth/llm_course/issues/261 的帖子，这位同学已录屏。

HswOAuth commented

2024-10-23 20:33:45 +08:00

麻烦截下图，看看启动容器镜像时，选择的镜像版本是多少？

guowenlong commented

2024-10-23 21:46:46 +08:00

两台实例都截图了

我的环境依赖：

(llama_factory) root@autodl-container-24364abc14-e03cb26c:~/LLaMA-Factory# pip list
Package Version Editable project location

accelerate 0.34.2
aiofiles 23.2.1
aiohappyeyeballs 2.4.3
aiohttp 3.10.10
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.6.2.post1
attrs 24.2.0
certifi 2024.8.30
charset-normalizer 3.4.0
click 8.1.7
contourpy 1.3.0
cycler 0.12.1
datasets 2.21.0
dill 0.3.8
docstring_parser 0.16
einops 0.8.0
fastapi 0.115.2
ffmpy 0.4.0
filelock 3.16.1
fire 0.7.0
fonttools 4.54.1
frozenlist 1.4.1
fsspec 2024.6.1
gradio 5.3.0
gradio_client 1.4.2
h11 0.14.0
httpcore 1.0.6
httpx 0.27.2
huggingface-hub 0.26.1
idna 3.10
jieba 0.42.1
Jinja2 3.1.4
joblib 1.4.2
kiwisolver 1.4.7
llamafactory 0.9.1.dev0 /root/LLaMA-Factory
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.9.2
mdurl 0.1.2
modelscope 1.19.0
mpmath 1.3.0
multidict 6.1.0
multiprocess 0.70.16
networkx 3.4.2
nltk 3.9.1
numpy 1.26.4
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
orjson 3.10.9
packaging 24.1
pandas 2.2.3
peft 0.12.0
pillow 10.4.0
pip 24.2
propcache 0.2.0
protobuf 5.28.2
psutil 6.1.0
pyarrow 17.0.0
pydantic 2.9.2
pydantic_core 2.23.4
pydub 0.25.1
Pygments 2.18.0
pyparsing 3.2.0
python-dateutil 2.9.0.post0
python-multipart 0.0.12
pytz 2024.2
PyYAML 6.0.2
regex 2024.9.11
requests 2.32.3
rich 13.9.2
rouge-chinese 1.0.3
ruff 0.7.0
safetensors 0.4.5
scipy 1.14.1
semantic-version 2.10.0
sentencepiece 0.2.0
setuptools 75.1.0
shellingham 1.5.4
shtab 1.7.1
six 1.16.0
sniffio 1.3.1
sse-starlette 2.1.3
starlette 0.40.0
sympy 1.13.1
termcolor 2.5.0
tiktoken 0.8.0
tokenizers 0.20.1
tomlkit 0.12.0
torch 2.5.0
tqdm 4.66.5
transformers 4.45.0
triton 3.1.0
trl 0.9.6
typer 0.12.5
typing_extensions 4.12.2
tyro 0.8.13
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.32.0
websockets 12.0
wheel 0.44.0
xxhash 3.5.0
yarl 1.16.0
(llama_factory) root@autodl-container-24364abc14-e03cb26c:~/LLaMA-Factory#

我明天再完整录屏下。

两台实例都截图了 ![image](/attachments/483b468b-7268-4a7e-8696-5744b57a2f47) ![image](/attachments/f288f383-4b1f-4853-b200-1e595fb687ef) 我的环境依赖： ![image](/attachments/24c1ff41-20d5-44aa-88a2-ebb7f93280c2) (llama_factory) root@autodl-container-24364abc14-e03cb26c:~/LLaMA-Factory# pip list Package Version Editable project location ------------------------ ----------- ------------------------- accelerate 0.34.2 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.6.2.post1 attrs 24.2.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 contourpy 1.3.0 cycler 0.12.1 datasets 2.21.0 dill 0.3.8 docstring_parser 0.16 einops 0.8.0 fastapi 0.115.2 ffmpy 0.4.0 filelock 3.16.1 fire 0.7.0 fonttools 4.54.1 frozenlist 1.4.1 fsspec 2024.6.1 gradio 5.3.0 gradio_client 1.4.2 h11 0.14.0 httpcore 1.0.6 httpx 0.27.2 huggingface-hub 0.26.1 idna 3.10 jieba 0.42.1 Jinja2 3.1.4 joblib 1.4.2 kiwisolver 1.4.7 llamafactory 0.9.1.dev0 /root/LLaMA-Factory markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.2 mdurl 0.1.2 modelscope 1.19.0 mpmath 1.3.0 multidict 6.1.0 multiprocess 0.70.16 networkx 3.4.2 nltk 3.9.1 numpy 1.26.4 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 orjson 3.10.9 packaging 24.1 pandas 2.2.3 peft 0.12.0 pillow 10.4.0 pip 24.2 propcache 0.2.0 protobuf 5.28.2 psutil 6.1.0 pyarrow 17.0.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.12 pytz 2024.2 PyYAML 6.0.2 regex 2024.9.11 requests 2.32.3 rich 13.9.2 rouge-chinese 1.0.3 ruff 0.7.0 safetensors 0.4.5 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 75.1.0 shellingham 1.5.4 shtab 1.7.1 six 1.16.0 sniffio 1.3.1 sse-starlette 2.1.3 starlette 0.40.0 sympy 1.13.1 termcolor 2.5.0 tiktoken 0.8.0 tokenizers 0.20.1 tomlkit 0.12.0 torch 2.5.0 tqdm 4.66.5 transformers 4.45.0 triton 3.1.0 trl 0.9.6 typer 0.12.5 typing_extensions 4.12.2 tyro 0.8.13 tzdata 2024.2 urllib3 2.2.3 uvicorn 0.32.0 websockets 12.0 wheel 0.44.0 xxhash 3.5.0 yarl 1.16.0 (llama_factory) root@autodl-container-24364abc14-e03cb26c:~/LLaMA-Factory# 我明天再完整录屏下。

image.png

75 KiB

image.png

90 KiB

image.png

52 KiB

guowenlong commented

2024-10-24 10:32:38 +08:00

你好，老师，已经录屏，微调后还是乱回答。我关注了#279 ，解决了Could not load library libnvrtc.so.12. Error问题。命令为： export LD_LIBRARY_PATH=/root/miniconda3/envs/llama_factory/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib:$LD_LIBRARY_PATH

part2.mp4

34 MiB

part1.mp4

80 MiB

~~12390900721cs referenced this issue 2024-10-24 18:19:42 +08:00~~

【求助】基于LLaMA-Factory的模型微调训练train_loss为1985太大 #243

11648734137cs commented

2024-10-24 22:09:47 +08:00

你好，老师，已经录屏，微调后还是乱回答。我关注了#279 ，解决了Could not load library libnvrtc.so.12. Error问题。命令为： export LD_LIBRARY_PATH=/root/miniconda3/envs/llama_factory/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib:$LD_LIBRARY_PATH

换个模型，GLM4模型的问题。用Qwen-7B-Chat可行。

> 你好，老师，已经录屏，微调后还是乱回答。我关注了#279 ，解决了Could not load library libnvrtc.so.12. Error问题。命令为： export LD_LIBRARY_PATH=/root/miniconda3/envs/llama_factory/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib:$LD_LIBRARY_PATH > > 换个模型，GLM4模型的问题。用Qwen-7B-Chat可行。

GANGUAGUA commented

2024-10-25 11:55:24 +08:00

请将LLaMA-Factory放置在/root/目录下，也就是：/root/LLaMA-Factory，然后再重复做一次实验

请将LLaMA-Factory放置在/root/目录下，也就是：/root/LLaMA-Factory，然后再重复做一次实验 ![image](/attachments/e196cfe5-65df-4f4f-9191-9f8167d69b1a)

image.png

78 KiB

guowenlong commented

2024-10-25 16:43:10 +08:00

请将LLaMA-Factory放置在/root/目录下，也就是：/root/LLaMA-Factory，然后再重复做一次实验

之前的实验都是该路径。

> 请将LLaMA-Factory放置在/root/目录下，也就是：/root/LLaMA-Factory，然后再重复做一次实验 > > ![image](/attachments/e196cfe5-65df-4f4f-9191-9f8167d69b1a) 之前的实验都是该路径。

guowenlong commented

2024-10-25 16:49:42 +08:00

使用Qwen-7B-Chat 进行实验，已经顺利完成：
1.基座模型加载和问答
2.指令微调后的模型加载和问答，回答符合预期。
3.量化后微调的加载和问答，回答符合预期。
见下述图片附件：

使用Qwen-7B-Chat 进行实验，已经顺利完成： 1.基座模型加载和问答 2.指令微调后的模型加载和问答，回答符合预期。 3.量化后微调的加载和问答，回答符合预期。见下述图片附件： ![q1.png](/attachments/671e5ec4-8d20-409e-aed6-8607461a4e71) ![q2.png](/attachments/e3f4b4a1-9abd-4369-a23f-6ebf23116707) ![q3.png](/attachments/29a4b810-5b0c-4f02-92b8-a1a43916c62f) ![q4.png](/attachments/01ed39e4-87e1-42e5-86ff-8384fbfcfea0) ![q5.png](/attachments/18e748e4-6e92-4174-b7e3-c49728e1d1b7)

q1.png

120 KiB

q2.png

197 KiB

q3.png

138 KiB

q4.png

145 KiB

q5.png

156 KiB

guowenlong referenced this issue

2024-10-25 16:57:28 +08:00