中文 llama3 仿 openai api 实战 #84

New Issue

12019701659cs · 2024-09-13T19:00:33+08:00

12019701659cs commented

2024-09-13 19:00:33 +08:00

中文 llama3 仿 openai api 实战

# 设置环境变量
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
export HF_HOME=/code/huggingface-cache/
export HF_ENDPOINT=https://hf-mirror.com
export http_proxy=http://10.10.9.50:3000
export https_proxy=http://10.10.9.50:3000
export no_proxy=localhost,127.0.0.1

下载源码

cd ~/ && wget https://file.huishiwei.top/Chinese-LLaMA-Alpaca-3-3.0.tar.gz
tar -xvf Chinese-LLaMA-Alpaca-3-3.0.tar.gz

安装 miniconda

cd ~/ && wget https://repo.anaconda.com/miniconda/Miniconda3-py38_23.5.2-0-Linux-x86_64.sh
bash Miniconda3-py38_23.5.2-0-Linux-x86_64.sh

创建虚拟环境

conda create -n chinese_llama_alpaca_3 python=3.8.17 pip -y

模型下载

conda activate chinese_llama_alpaca_3
pip install modelscope -i https://mirrors.aliyun.com/pypi/simple

直接使用命令下载

modelscope download --model ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3

基于 screen 下载

screen -R model_download
conda activate chinese_llama_alpaca_3
pip install modelscope -i https://mirrors.aliyun.com/pypi/simple
modelscope download --model ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3

如果在操作过程中断开了服务器连接，可以使用 screen -r model_download 恢复操作窗口

模型存储位置

使用 modelscope 下载完毕后，模型会存储在如下位置：
~/.cache/modelscope/hub/ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3，正常情
况下，可以看到如下内容：

ls -alhrt ~/.cache/modelscope/hub/ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3

开源版 openai 接口启动

中文 llama3 的开源版本实现在以下目录：/root/Chinese-LLaMA-Alpaca-3-3.0/scripts/oai_api_demo，下面分别是 GPU 和 CPU 版本的启动流程

推理脚本 BUG 修复

在启动 GPU 或者 CPU 版本的仿 openai 接口（stream 流式方式，类似 openai 打字机回复效
果）脚本之前，需要修复一个 bug，打开/root/Chinese-LLaMA-Alpaca-3-3.0/scripts/oai_api_demo/openai_api_server.py 文件：

将 generation_kwargs 改为如下值：

generation_kwargs = dict(
streamer=streamer,
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=False,
max_new_tokens=max_new_tokens,
repetition_penalty=float(repetition_penalty),
pad_token_id=tokenizer.eos_token_id, # 新添加的参数
eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")] # 新添加的参数
)

加入上面的参数主要是为了兼容 llama3 特有的停止 token，不然流式接口返回的内容会不断
的自动重复，不停止。

GPU 版本

备份脚本

mv /root/Chinese-LLaMA-Alpaca-3-3.0/requirements.txt /root/Chinese-LLaMA-Alpaca-3-3.0/requirements.bk.txt

安装依赖

通过如下命令创建新的 requirements.txt

cat <<EOF > /root/Chinese-LLaMA-Alpaca-3-3.0/requirements.txt
accelerate==0.30.0
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.6.0
anyio==4.3.0
async-timeout==4.0.3
attrs==23.2.0
bitsandbytes==0.43.1
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
datasets==2.20.0
deepspeed==0.13.1
dill==0.3.7
dnspython==2.6.1
einops==0.8.0
email_validator==2.1.1
exceptiongroup==1.2.1
fastapi==0.109.2
fastapi-cli==0.0.3
filelock==3.14.0
frozenlist==1.4.1
fsspec==2023.10.0
h11==0.14.0
hjson==3.1.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.3
idna==3.7
Jinja2==3.1.4
joblib==1.4.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
modelscope==1.17.1
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
networkx==3.1
ninja==1.11.1.1
numpy==1.24.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.1.105
orjson==3.10.3
packaging==24.0
pandas==2.0.3
peft==0.7.1
psutil==5.9.8
py-cpuinfo==9.0.0
pyarrow==16.0.0
pyarrow-hotfix==0.6
pydantic==1.10.11
pydantic_core==2.18.2
Pygments==2.18.0
pynvml==11.5.0
python-dateutil==2.9.0.post0
python-decouple==3.8
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
regex==2024.4.28
requests==2.32.3
rich==13.7.1
safetensors==0.4.3
scikit-learn==1.3.2
scipy==1.10.1
shellingham==1.5.4
shortuuid==1.0.13
six==1.16.0
sniffio==1.3.1
sse-starlette==2.1.0
starlette==0.36.3
sympy==1.12
threadpoolctl==3.5.0
tokenizers==0.19.1
torch==2.1.2
tqdm==4.66.4
transformers==4.41.2
triton==2.1.0
typer==0.12.3
typing_extensions==4.11.0
tzdata==2024.1
ujson==5.9.0
urllib3==2.2.1
uvicorn==0.29.0
uvloop==0.19.0
watchfiles==0.21.0
websockets==12.0
xxhash==3.4.1
yarl==1.9.4
EOF

安装依赖：

pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple

启动服务

python /root/Chinese-LLaMA-Alpaca-3-3.0/scripts/oai_api_demo/openai_api_server.py --gpus 0 --base_model ~/.cache/modelscope/hub/ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3

请注意--gpus 后面的参数 0，代表我们要使用第一块 GPU 卡运行大模型服务，如果您的机器只
有一块卡，您需要将该参数始终设置为 0，也就是--gpus 0
使用--load_in_8bit 或者--load_in_4bit可以开启量化加载。

测试效果

我们使用 ChatGPTNextWeb 工具测试我们的接口，如果您没有下载客户端，可以通过下面地址
下载：
Windows：https://github.com/ChatGPTNextWeb/ChatGPT-NextWeb/releases/download/v2.14.2/NextChat_2.14.2_x64-setup.exe
下载后，安装即可。

设置刚才部署的大模型服务，在平台部署，本地安装nextchat，平台无映射ip, 象征性表示。

# 中文 llama3 仿 openai api 实战 ```bash # 设置环境变量 export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=0 export NCCL_IB_HCA=mlx5 export NCCL_SOCKET_IFNAME=eth0 export GLOO_SOCKET_IFNAME=eth0 export HF_HOME=/code/huggingface-cache/ export HF_ENDPOINT=https://hf-mirror.com export http_proxy=http://10.10.9.50:3000 export https_proxy=http://10.10.9.50:3000 export no_proxy=localhost,127.0.0.1 ``` # 下载源码 ```bash cd ~/ && wget https://file.huishiwei.top/Chinese-LLaMA-Alpaca-3-3.0.tar.gz tar -xvf Chinese-LLaMA-Alpaca-3-3.0.tar.gz ``` # 安装 miniconda ```bash cd ~/ && wget https://repo.anaconda.com/miniconda/Miniconda3-py38_23.5.2-0-Linux-x86_64.sh bash Miniconda3-py38_23.5.2-0-Linux-x86_64.sh ``` # 创建虚拟环境 ```bash conda create -n chinese_llama_alpaca_3 python=3.8.17 pip -y ``` # 模型下载 ```bash conda activate chinese_llama_alpaca_3 pip install modelscope -i https://mirrors.aliyun.com/pypi/simple ``` ## 直接使用命令下载 ```bash modelscope download --model ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3 ``` ## 基于 screen 下载 ```bash screen -R model_download conda activate chinese_llama_alpaca_3 pip install modelscope -i https://mirrors.aliyun.com/pypi/simple modelscope download --model ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3 ``` 如果在操作过程中断开了服务器连接，可以使用 screen -r model_download 恢复操作窗口 ## 模型存储位置使用 modelscope 下载完毕后，模型会存储在如下位置： ~/.cache/modelscope/hub/ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3，正常情况下，可以看到如下内容： ```bash ls -alhrt ~/.cache/modelscope/hub/ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3 ``` # 开源版 openai 接口启动中文 llama3 的开源版本实现在以下目录：/root/Chinese-LLaMA-Alpaca-3-3.0/scripts/oai_api_demo，下面分别是 GPU 和 CPU 版本的启动流程 # 推理脚本 BUG 修复在启动 GPU 或者 CPU 版本的仿 openai 接口（stream 流式方式，类似 openai 打字机回复效果）脚本之前，需要修复一个 bug，打开/root/Chinese-LLaMA-Alpaca-3-3.0/scripts/oai_api_demo/openai_api_server.py 文件： ```bash 将 generation_kwargs 改为如下值： generation_kwargs = dict( streamer=streamer, input_ids=input_ids, generation_config=generation_config, return_dict_in_generate=True, output_scores=False, max_new_tokens=max_new_tokens, repetition_penalty=float(repetition_penalty), pad_token_id=tokenizer.eos_token_id, # 新添加的参数 eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")] # 新添加的参数 ) ``` 加入上面的参数主要是为了兼容 llama3 特有的停止 token，不然流式接口返回的内容会不断的自动重复，不停止。 # GPU 版本 ## 备份脚本 ```bash mv /root/Chinese-LLaMA-Alpaca-3-3.0/requirements.txt /root/Chinese-LLaMA-Alpaca-3-3.0/requirements.bk.txt ``` ## 安装依赖通过如下命令创建新的 requirements.txt ```bash cat <<EOF > /root/Chinese-LLaMA-Alpaca-3-3.0/requirements.txt accelerate==0.30.0 aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.6.0 anyio==4.3.0 async-timeout==4.0.3 attrs==23.2.0 bitsandbytes==0.43.1 certifi==2024.2.2 charset-normalizer==3.3.2 click==8.1.7 datasets==2.20.0 deepspeed==0.13.1 dill==0.3.7 dnspython==2.6.1 einops==0.8.0 email_validator==2.1.1 exceptiongroup==1.2.1 fastapi==0.109.2 fastapi-cli==0.0.3 filelock==3.14.0 frozenlist==1.4.1 fsspec==2023.10.0 h11==0.14.0 hjson==3.1.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.23.3 idna==3.7 Jinja2==3.1.4 joblib==1.4.2 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 modelscope==1.17.1 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.15 networkx==3.1 ninja==1.11.1.1 numpy==1.24.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.1.105 orjson==3.10.3 packaging==24.0 pandas==2.0.3 peft==0.7.1 psutil==5.9.8 py-cpuinfo==9.0.0 pyarrow==16.0.0 pyarrow-hotfix==0.6 pydantic==1.10.11 pydantic_core==2.18.2 Pygments==2.18.0 pynvml==11.5.0 python-dateutil==2.9.0.post0 python-decouple==3.8 python-dotenv==1.0.1 python-multipart==0.0.9 pytz==2024.1 PyYAML==6.0.1 regex==2024.4.28 requests==2.32.3 rich==13.7.1 safetensors==0.4.3 scikit-learn==1.3.2 scipy==1.10.1 shellingham==1.5.4 shortuuid==1.0.13 six==1.16.0 sniffio==1.3.1 sse-starlette==2.1.0 starlette==0.36.3 sympy==1.12 threadpoolctl==3.5.0 tokenizers==0.19.1 torch==2.1.2 tqdm==4.66.4 transformers==4.41.2 triton==2.1.0 typer==0.12.3 typing_extensions==4.11.0 tzdata==2024.1 ujson==5.9.0 urllib3==2.2.1 uvicorn==0.29.0 uvloop==0.19.0 watchfiles==0.21.0 websockets==12.0 xxhash==3.4.1 yarl==1.9.4 EOF ``` 安装依赖： ```bash pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple ``` # 启动服务 ```bash python /root/Chinese-LLaMA-Alpaca-3-3.0/scripts/oai_api_demo/openai_api_server.py --gpus 0 --base_model ~/.cache/modelscope/hub/ChineseAlpacaGroup/llama-3-chinese-8b-instruct-v3 ``` 请注意--gpus 后面的参数 0，代表我们要使用第一块 GPU 卡运行大模型服务，如果您的机器只有一块卡，您需要将该参数始终设置为 0，也就是--gpus 0 使用--load_in_8bit 或者--load_in_4bit可以开启量化加载。 ![image11.png](/attachments/a1577c0d-12b4-4e10-b8a5-2647309ecb54) # 测试效果我们使用 ChatGPTNextWeb 工具测试我们的接口，如果您没有下载客户端，可以通过下面地址下载： Windows：https://github.com/ChatGPTNextWeb/ChatGPT-NextWeb/releases/download/v2.14.2/NextChat_2.14.2_x64-setup.exe 下载后，安装即可。设置刚才部署的大模型服务，在平台部署，本地安装nextchat，平台无映射ip, 象征性表示。 ![image10.png](/attachments/23c7bb3c-5cd4-4b27-bc63-cfd48523cca3)

image10.png

105 KiB

image11.png

31 KiB

Sign in to join this conversation.