【求助帖】利用2机8卡实验llama2_7b_chat_qlora_sql_e3_copy.py 报错 #191

Open
opened 2024-10-14 18:29:28 +08:00 by 11578866110cs · 1 comment
  1. 确认两个机器是否都有运行下面的命令:
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA==mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
export HF_HOME=/code/huggingface-cache/
  1. 作为master的机器的ip需要确认下:
root@k3418059a0f845f39bf07ee93d808743-task1-0:/# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1480
        inet 10.244.15.24  netmask 255.255.255.255  broadcast 10.244.15.24
        ether 82:75:57:6c:dc:4a  txqueuelen 0  (Ethernet)
        RX packets 7315  bytes 1858059 (1.8 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5992  bytes 9094025 (9.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 21991  bytes 2127423 (2.1 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 21991  bytes 2127423 (2.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
  1. 两个机器的启动命令确认一下:
    master机器的ip 10.244.15.24写在ADDR=10.244.15.24, 且master的NODE_RANK=0
# 在第1台机器执行
NNODES=2 NPROC_PER_NODE=4 PORT=12345 ADDR=10.244.15.24 NODE_RANK=0 xtuner train /code/llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /userhome/xtuner-workdir --deepspeed deepspeed_zero3
# 在第2台机器执行:
NNODES=2 NPROC_PER_NODE=4 PORT=12345 ADDR=10.244.15.24 NODE_RANK=1 xtuner train /code/llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /userhome/xtuner-workdir --deepspeed deepspeed_zero3

如还报错,把1.master 机器的ip 的查询结果;2.两个机器运行的命令;3.两个机器运行的完整日志;4. 运行的python文件里的内容。 4个信息都发一下。

1. 确认两个机器是否都有运行下面的命令: ``` export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=0 export NCCL_IB_HCA==mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 export NCCL_SOCKET_IFNAME=eth0 export GLOO_SOCKET_IFNAME=eth0 export HF_HOME=/code/huggingface-cache/ ``` 2. 作为master的机器的ip需要确认下: ``` root@k3418059a0f845f39bf07ee93d808743-task1-0:/# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1480 inet 10.244.15.24 netmask 255.255.255.255 broadcast 10.244.15.24 ether 82:75:57:6c:dc:4a txqueuelen 0 (Ethernet) RX packets 7315 bytes 1858059 (1.8 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 5992 bytes 9094025 (9.0 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1000 (Local Loopback) RX packets 21991 bytes 2127423 (2.1 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 21991 bytes 2127423 (2.1 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ``` 3. 两个机器的启动命令确认一下: master机器的ip 10.244.15.24写在ADDR=10.244.15.24, 且master的NODE_RANK=0 ``` # 在第1台机器执行 NNODES=2 NPROC_PER_NODE=4 PORT=12345 ADDR=10.244.15.24 NODE_RANK=0 xtuner train /code/llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /userhome/xtuner-workdir --deepspeed deepspeed_zero3 # 在第2台机器执行: NNODES=2 NPROC_PER_NODE=4 PORT=12345 ADDR=10.244.15.24 NODE_RANK=1 xtuner train /code/llama2_7b_chat_qlora_sql_e3_copy.py --work-dir /userhome/xtuner-workdir --deepspeed deepspeed_zero3 ``` 如还报错,把1.master 机器的ip 的查询结果;2.两个机器运行的命令;3.两个机器运行的完整日志;4. 运行的python文件里的内容。 4个信息都发一下。
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_course#191
No description provided.