【求助帖】Function Call实践（三）——微调Llama3-8B-Instruct模型(训练到中间会有内存压力而停止） #53

New Issue

11175663820cs · 2024-12-10T01:31:03+08:00

11175663820cs commented

2024-12-10 01:31:03 +08:00

作业是关于用function_calling_small这个dataset来微调模型。首先下载数据集，使用命令：
huggingface-cli download Deepexi/function-calling-small --repo-type dataset --revision main --local-dir-use-symlinks False --local-dir
/code/function-calling-small

下载完成后看到数据集有3列，分别是“systempromt, userprompt, assistantResponse". 要转换成xtuner能用的格式，显示的信息是需要调节成为有“instruction，input，output”格式的数据集，试验发现训练要求有messages的列。回看例子的数据集，在messages下有三种不同的role以及相关的content组成的messages。用下面的代码调整数据集。


import json
import os
import sys
from datasets import Dataset
from datasets import load_dataset`

# Load the dataset
dataset = load_dataset("/code/function-calling-small")

def transform_to_openai_format(example):
    """
    Transform 'systemPrompt', 'userPrompt', and 'assistantResponse'
    into the 'messages' format required by openai_map_fn.
    """
    messages = []

    # Add the system message
    if example["systemPrompt"]:
        messages.append({"role": "system", "content": example["systemPrompt"]})

    # Add the user query
    if example["userPrompt"]:
        messages.append({"role": "user", "content": example["userPrompt"]})

    # Add the assistant's response
    if example["assistantResponse"]:
        messages.append({"role": "assistant", "content": example["assistantResponse"]})

    return {"messages": messages}

# Apply the transformation to create 'messages'
xtuner_dataset = dataset.map(transform_to_openai_format, remove_columns=["systemPrompt", "userPrompt", "assistantResponse"])

# Save the transformed dataset
xtuner_dataset["train"].to_json("/code/function-calling-small-convert/xtuner_train.json")`

数据示例：

{'messages': [{'content': '你是一个函数筛选助理，如果与问题相关的话,您可以使用下面的函数来获取更多数据以回答用户提出的问题:\n{"function": "CreateExportMigration", "description": "使用CreateExportMigration，新建DataWorks导出任务且仅创建导出任务。", "arguments": [{"name": "ProjectId", "type": "integer", "description": "DataWorks工作空间的ID。您可以登录DataWorks控制台，进入工作空间配置页面获取工作空间ID。"}, {"name": "Name", "type": "string", "description": "导出任务的名称。\n\n名称必须唯一，即当前DataWorks工作空间中不能存在名称重复的导出任务。"}, {"name": "ExportMode", "type": "string", "description": "任务的导出模式，取值如下：\n- FULL：全量导出目标任务。\n- INCREMENTAL：从指定的时间点开始，增量导出目标任务。选择该模式时，需要同时配置IncrementalSince参数。"}, {"name": "IncrementalSince", "type": "integer", "description": "增量导出目标任务的起始时间。\n\n当ExportMode参数配置为INCREMENTAL时，IncrementalSince参数才生效。"}, {"name": "ExportObjectStatus", "type": "string", "description": "导出任务的状态。系统会根据所选状态导出指定状态的任务。取值如下：\n- SAVED：保存状态，即导出已保存的任务。\n- SUBMITTED：提交状态，即导出已提交的任务。\n- DEPLOYED：发布状态，即导出已发布的任务。"}, {"name": "Description", "type": "string", "description": "导出任务的描述信息。"}]}\n\n{"function": "UpdateTicketNum", "description": "对用于免登嵌入报表的指定的ticket进行更新票据数量操作。", "arguments": [{"name": "Ticket", "type": "string", "description": "三方嵌入的票据值，即URL中的accessTicket值。"}, {"name": "TicketNum", "type": "integer", "description": "票据数。\n- 取值范围：1~99998，建议值为1。"}]}\n\n{"function": "CreateSavepoint", "description": "创建快照", "arguments": [{"name": "workspace", "type": "string", "description": "工作空间ID。"}, {"name": "namespace", "type": "string", "description": "项目空间名称。"}, {"name": "body", "type": "object", "description": "触发savepoint参数。"}]}\n\n"\n 请以如下格式回复：:\n {\n "function": "function_name",\n "arguments": {\n "argument1": value1,\n "argument2": value2\n }\n }',
'role': 'system'},
{'content': ' "更新免登嵌入报表的票据数量为10的票据值为"abcd1234"。" ', 'role': 'user'},
{'content': '{\n "function": "UpdateTicketNum",\n "arguments": [\n {\n "Ticket": "abcd1234",\n "TicketNum": 10\n }\n ]\n}',
'role': 'assistant'}]}

修改训练程序的数据集加载:


agent_flan = dict(
    type=process_hf_dataset,
    # dataset=dict(type=load_from_disk, dataset_path=agent_flan_path),
    dataset=dict(type=load_dataset, path="/code/function-calling-small-convert", data_files={
    "train": "xtuner_train.json"
    }),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=openai_map_fn,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    remove_unused_columns=True,
    shuffle_before_pack=True,
    pack_to_max_length=pack_to_max_length,
    use_varlen_attn=use_varlen_attn)

在模型调试测试训练程序：NPROC_PER_NODE=4 xtuner train /code/llama3_8b_instruct_qlora_function_calling_small.py --work-dir /userhome/llama3-8b-ft/agent-flan --deepspeed deepspeed_zero3_offload

运行顺利：

按照课堂示例进入训练管理，然后启动任务，任务用6个节点开启，等待训练结果, 预计1天2小时训练结束：

1010次循环以后训练报错

E1210 02:10:43.721180  1121 ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out.
I1210 02:10:43.721319  1121 ProcessGroupNCCL.cpp:874] [Rank 2] Destroyed 1communicators on CUDA device 2
E1210 02:10:43.721338  1121 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E1210 02:10:43.721356  1121 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
ee1f5a08a17a4c27939b5c125b56deef-task0-0:135:1121 [0] NCCL INFO comm 0x7f4993b43b60 rank 2 nranks 24 cudaDev 2 busId 43000 - Abort COMPLETE
E1210 02:10:43.723716  1120 ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800606 milliseconds before timing out.

查询得知可能是time out，尝试改善timeout的限制和通讯延迟，在run.sh里面加入：

export NCCL_SOCKET_IFNAME=^lo,docker0
export NCCL_IB_DISABLE=0
export NCCL_P2P_LEVEL=SYS
export NCCL_TIMEOUT=3600

再次训练可以成功训练至300次循环，类似的报错再次发生：

12/10 05:47:27 - mmengine - INFO - Iter(train) [ 300/2628]  lr: 1.9632e-04  eta: 22:38:53  time: 35.4292  data_time: 0.0181  memory: 6900  loss: 0.0909  tflops: 0.7778  tokens_per_sec: 11.8040
E1210 06:18:45.497431   192 ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=113741, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800131 milliseconds before timing out.
I1210 06:18:45.497562   192 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1

查看训练日志，发现loss迅速减小以后保持稳定，可能有过拟合的现象，但是没有完整训练过，不能判断，需要尝试在150次循环第一次降到0.1以下之前保存数据：

2024/12/10 02:58:27 - mmengine - INFO - Iter(train) [  10/2628]  lr: 2.3378e-05  eta: 1 day, 2:38:46  time: 36.6411  data_time: 0.0172  memory: 6313  loss: 0.5707  tflops: 1.4485  tokens_per_sec: 21.8737
2024/12/10 03:04:15 - mmengine - INFO - Iter(train) [  20/2628]  lr: 4.9352e-05  eta: 1 day, 1:53:25  time: 34.8358  data_time: 0.0189  memory: 6773  loss: 0.7009  tflops: 1.4446  tokens_per_sec: 21.8108
2024/12/10 03:09:59 - mmengine - INFO - Iter(train) [  30/2628]  lr: 7.5326e-05  eta: 1 day, 1:28:08  time: 34.3990  data_time: 0.0173  memory: 6760  loss: 0.7877  tflops: 1.2960  tokens_per_sec: 19.5886
2024/12/10 03:15:52 - mmengine - INFO - Iter(train) [  40/2628]  lr: 1.0130e-04  eta: 1 day, 1:21:42  time: 35.2408  data_time: 0.0167  memory: 6406  loss: 0.5262  tflops: 0.7930  tokens_per_sec: 12.0395
2024/12/10 03:21:43 - mmengine - INFO - Iter(train) [  50/2628]  lr: 1.2727e-04  eta: 1 day, 1:15:03  time: 35.1888  data_time: 0.0177  memory: 6323  loss: 0.6106  tflops: 2.4542  tokens_per_sec: 36.7325
2024/12/10 03:27:28 - mmengine - INFO - Iter(train) [  60/2628]  lr: 1.5325e-04  eta: 1 day, 1:03:05  time: 34.4088  data_time: 0.0175  memory: 6887  loss: 0.4257  tflops: 1.2652  tokens_per_sec: 19.1390
2024/12/10 03:33:21 - mmengine - INFO - Iter(train) [  70/2628]  lr: 1.7922e-04  eta: 1 day, 0:58:26  time: 35.3171  data_time: 0.0167  memory: 6907  loss: 0.2403  tflops: 2.8644  tokens_per_sec: 42.7948
2024/12/10 03:39:08 - mmengine - INFO - Iter(train) [  80/2628]  lr: 2.0000e-04  eta: 1 day, 0:50:29  time: 34.7514  data_time: 0.0178  memory: 6888  loss: 0.1763  tflops: 1.3519  tokens_per_sec: 20.4296
2024/12/10 03:44:53 - mmengine - INFO - Iter(train) [  90/2628]  lr: 1.9999e-04  eta: 1 day, 0:41:45  time: 34.4841  data_time: 0.0172  memory: 6907  loss: 0.1485  tflops: 1.6709  tokens_per_sec: 25.1992
2024/12/10 03:50:53 - mmengine - INFO - Iter(train) [ 100/2628]  lr: 1.9997e-04  eta: 1 day, 0:40:03  time: 36.0128  data_time: 0.0173  memory: 6027  loss: 0.1951  tflops: 1.9293  tokens_per_sec: 28.9919
2024/12/10 03:56:40 - mmengine - INFO - Iter(train) [ 110/2628]  lr: 1.9993e-04  eta: 1 day, 0:32:40  time: 34.7252  data_time: 0.0173  memory: 6587  loss: 0.1932  tflops: 1.4727  tokens_per_sec: 22.2321
2024/12/10 04:02:32 - mmengine - INFO - Iter(train) [ 120/2628]  lr: 1.9987e-04  eta: 1 day, 0:27:03  time: 35.1616  data_time: 0.0177  memory: 6830  loss: 0.1311  tflops: 2.8311  tokens_per_sec: 42.1947
2024/12/10 04:08:28 - mmengine - INFO - Iter(train) [ 130/2628]  lr: 1.9980e-04  eta: 1 day, 0:22:40  time: 35.5532  data_time: 0.0171  memory: 6503  loss: 0.2549  tflops: 3.2039  tokens_per_sec: 47.8660
2024/12/10 04:14:25 - mmengine - INFO - Iter(train) [ 140/2628]  lr: 1.9972e-04  eta: 1 day, 0:18:38  time: 35.7463  data_time: 0.0171  memory: 6782  loss: 0.1906  tflops: 1.4399  tokens_per_sec: 21.7591
2024/12/10 04:20:11 - mmengine - INFO - Iter(train) [ 150/2628]  lr: 1.9962e-04  eta: 1 day, 0:11:17  time: 34.6388  data_time: 0.0166  memory: 6466  loss: 0.1710  tflops: 2.1321  tokens_per_sec: 32.0057
2024/12/10 04:25:45 - mmengine - INFO - Iter(train) [ 160/2628]  lr: 1.9950e-04  eta: 1 day, 0:00:49  time: 33.3457  data_time: 0.0181  memory: 6334  loss: 0.0438  tflops: 2.4341  tokens_per_sec: 36.4196
2024/12/10 04:31:32 - mmengine - INFO - Iter(train) [ 170/2628]  lr: 1.9937e-04  eta: 23:54:20  time: 34.7592  data_time: 0.0169  memory: 5750  loss: 0.0409  tflops: 1.6331  tokens_per_sec: 24.6103
2024/12/10 04:37:19 - mmengine - INFO - Iter(train) [ 180/2628]  lr: 1.9923e-04  eta: 23:47:34  time: 34.6040  data_time: 0.0168  memory: 6496  loss: 0.0831  tflops: 2.8580  tokens_per_sec: 42.7016
2024/12/10 04:43:18 - mmengine - INFO - Iter(train) [ 190/2628]  lr: 1.9907e-04  eta: 23:43:52  time: 35.9864  data_time: 0.0164  memory: 5969  loss: 0.1775  tflops: 2.1151  tokens_per_sec: 31.7504
2024/12/10 04:49:16 - mmengine - INFO - Iter(train) [ 200/2628]  lr: 1.9889e-04  eta: 23:39:33  time: 35.7930  data_time: 0.0168  memory: 6208  loss: 0.0859  tflops: 2.0598  tokens_per_sec: 30.9505
2024/12/10 04:55:06 - mmengine - INFO - Iter(train) [ 210/2628]  lr: 1.9870e-04  eta: 23:33:32  time: 34.9886  data_time: 0.0178  memory: 6819  loss: 0.0701  tflops: 1.3770  tokens_per_sec: 20.8340
2024/12/10 05:00:57 - mmengine - INFO - Iter(train) [ 220/2628]  lr: 1.9850e-04  eta: 23:27:46  time: 35.1225  data_time: 0.0173  memory: 6901  loss: 0.0907  tflops: 3.2577  tokens_per_sec: 48.4986
2024/12/10 05:06:44 - mmengine - INFO - Iter(train) [ 230/2628]  lr: 1.9827e-04  eta: 23:21:16  time: 34.6971  data_time: 0.0172  memory: 6713  loss: 0.1388  tflops: 1.5842  tokens_per_sec: 23.8815
2024/12/10 05:12:37 - mmengine - INFO - Iter(train) [ 240/2628]  lr: 1.9804e-04  eta: 23:15:45  time: 35.2600  data_time: 0.0166  memory: 6791  loss: 0.0723  tflops: 0.9565  tokens_per_sec: 14.5092
2024/12/10 05:18:12 - mmengine - INFO - Iter(train) [ 250/2628]  lr: 1.9779e-04  eta: 23:07:28  time: 33.5367  data_time: 0.0163  memory: 5862  loss: 0.0879  tflops: 1.7685  tokens_per_sec: 26.6434
2024/12/10 05:24:14 - mmengine - INFO - Iter(train) [ 260/2628]  lr: 1.9752e-04  eta: 23:03:27  time: 36.2075  data_time: 0.0165  memory: 6623  loss: 0.0272  tflops: 1.4504  tokens_per_sec: 21.8761
2024/12/10 05:30:01 - mmengine - INFO - Iter(train) [ 270/2628]  lr: 1.9724e-04  eta: 22:57:02  time: 34.6540  data_time: 0.0172  memory: 6500  loss: 0.0947  tflops: 1.7753  tokens_per_sec: 26.7083
2024/12/10 05:35:46 - mmengine - INFO - Iter(train) [ 280/2628]  lr: 1.9695e-04  eta: 22:50:26  time: 34.4970  data_time: 0.0165  memory: 6079  loss: 0.1222  tflops: 0.7416  tokens_per_sec: 11.2570
2024/12/10 05:41:33 - mmengine - INFO - Iter(train) [ 290/2628]  lr: 1.9664e-04  eta: 22:44:11  time: 34.7083  data_time: 0.0168  memory: 6196  loss: 0.1291  tflops: 1.7407  tokens_per_sec: 26.2408
2024/12/10 05:47:27 - mmengine - INFO - Iter(train) [ 300/2628]  lr: 1.9632e-04  eta: 22:38:53  time: 35.4292  data_time: 0.0181  memory: 6900  loss: 0.0909  tflops: 0.7778  tokens_per_sec: 11.8040

在llama3_8b_instruct_qlora_agentflan_3e.py里面修改存储的循环次数

# Save
save_steps = 150

再去xtuner包下面找到控制循环的程序： /opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loop.py

添加NCCL debug，在循环中加入缓存清理：

# Enable NCCL Debugging
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_DEBUG_SUBSYS"] = "ALL"

    def run_iter(self, data_batch):
        """Override run_iter to add memory clearing and debugging."""
        # Clear GPU cache before starting the iteration
        torch.cuda.empty_cache()

        # Log memory usage every 10 iterations
        if self.runner.iter % 10 == 0:
            print(f"Iteration {self.runner.iter}:")
            print(torch.cuda.memory_summary())

        # Call parent method to process the iteration
        super().run_iter(data_batch)

再次尝试训练

训练到1010循环，还是有报错:

12/11 08:24:30 - mmengine - INFO - Iter(train) [1010/2628]  lr: 1.4112e-04  eta: 18:40:49  time: 102.0982  data_time: 68.2270  memory: 6693  loss: 0.0225  tflops: 2.1544  tokens_per_sec: 32.3661
kb5b6855e24e4d428aedc6f286628b24-task0-0:134:192 [0] NCCL INFO comm 0x7f330c001500 rank 1 nranks 24 cudaDev 1 busId 26000 - Abort COMPLETE
E1211 08:55:45.695231   192 ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800382 milliseconds before timing out.
I1211 08:55:45.695369   192 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1
E1211 08:55:45.695441   192 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E1211 08:55:45.695462   192 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
E1211 08:55:45.700608   192 ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800382 milliseconds before timing out.
E1211 08:55:45.807060   198 ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800348 milliseconds before timing out.
I1211 08:55:45.807195   198 ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0
E1211 08:55:45.807214   198 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
kb5b6855e24e4d428aedc6f286628b24-task0-0:133:198 [0] NCCL INFO comm 0x7fd774001500 rank 0 nranks 24 cudaDev 0 busId 4000 - Abort COMPLETE
E1211 08:55:45.807231   198 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
E1211 08:55:45.807571   198 ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800348 milliseconds before timing out.
E1211 08:55:46.013504   195 ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800798 milliseconds before timing out.
I1211 08:55:46.013639   195 ProcessGroupNCCL.cpp:874] [Rank 2] Destroyed 1communicators on CUDA device 2
E1211 08:55:46.013659   195 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E1211 08:55:46.013676   195 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
kb5b6855e24e4d428aedc6f286628b24-task0-0:135:195 [0] NCCL INFO comm 0x7fde3c001500 rank 2 nranks 24 cudaDev 2 busId 43000 - Abort COMPLETE
E1211 08:55:46.014129   195 ProcessGroupNCCL.cpp:915] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800798 milliseconds before timing out.
E1211 08:55:47.450834   189 ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800780 milliseconds before timing out.
I1211 08:55:47.450953   189 ProcessGroupNCCL.cpp:874] [Rank 3] Destroyed 1communicators on CUDA device 3
E1211 08:55:47.450974   189 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E1211 08:55:47.450995   189 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
kb5b6855e24e4d428aedc6f286628b24-task0-0:136:189 [0] NCCL INFO comm 0x7fc1e8001500 rank 3 nranks 24 cudaDev 3 busId 63000 - Abort COMPLETE
E1211 08:55:47.451445   189 ProcessGroupNCCL.cpp:915] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800780 milliseconds before timing out.
[2024-12-11 08:55:49,817] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 133) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-12-11_08:55:49
  host      : kb5b6855e24e4d428aedc6f286628b24-task0-0.kb5b6855e24e4d428aedc6f286628b24.f589a03bd5a3490f959fa89692cd2c3b.svc.cluster.local
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 134)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 134
[2]:
  time      : 2024-12-11_08:55:49
  host      : kb5b6855e24e4d428aedc6f286628b24-task0-0.kb5b6855e24e4d428aedc6f286628b24.f589a03bd5a3490f959fa89692cd2c3b.svc.cluster.local
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 135)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 135
[3]:
  time      : 2024-12-11_08:55:49
  host      : kb5b6855e24e4d428aedc6f286628b24-task0-0.kb5b6855e24e4d428aedc6f286628b24.f589a03bd5a3490f959fa89692cd2c3b.svc.cluster.local
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 136)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 136
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-11_08:55:49
  host      : kb5b6855e24e4d428aedc6f286628b24-task0-0.kb5b6855e24e4d428aedc6f286628b24.f589a03bd5a3490f959fa89692cd2c3b.svc.cluster.local
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 133)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 133
============================================================

作业是关于用function_calling_small这个dataset来微调模型。首先下载数据集，使用命令： huggingface-cli download Deepexi/function-calling-small --repo-type dataset --revision main --local-dir-use-symlinks False --local-dir /code/function-calling-small 下载完成后看到数据集有3列，分别是“systempromt, userprompt, assistantResponse". 要转换成xtuner能用的格式，显示的信息是需要调节成为有“instruction，input，output”格式的数据集，试验发现训练要求有messages的列。回看例子的数据集，在messages下有三种不同的role以及相关的content组成的messages。用下面的代码调整数据集。 ``` python import json import os import sys from datasets import Dataset from datasets import load_dataset` # Load the dataset dataset = load_dataset("/code/function-calling-small") def transform_to_openai_format(example): """ Transform 'systemPrompt', 'userPrompt', and 'assistantResponse' into the 'messages' format required by openai_map_fn. """ messages = [] # Add the system message if example["systemPrompt"]: messages.append({"role": "system", "content": example["systemPrompt"]}) # Add the user query if example["userPrompt"]: messages.append({"role": "user", "content": example["userPrompt"]}) # Add the assistant's response if example["assistantResponse"]: messages.append({"role": "assistant", "content": example["assistantResponse"]}) return {"messages": messages} # Apply the transformation to create 'messages' xtuner_dataset = dataset.map(transform_to_openai_format, remove_columns=["systemPrompt", "userPrompt", "assistantResponse"]) # Save the transformed dataset xtuner_dataset["train"].to_json("/code/function-calling-small-convert/xtuner_train.json")` ``` 数据示例： {'messages': [{'content': '你是一个函数筛选助理，如果与问题相关的话,您可以使用下面的函数来获取更多数据以回答用户提出的问题:\n{"function": "CreateExportMigration", "description": "使用CreateExportMigration，新建DataWorks导出任务且仅创建导出任务。", "arguments": [{"name": "ProjectId", "type": "integer", "description": "DataWorks工作空间的ID。您可以登录[DataWorks控制台](https://workbench.data.aliyun.com/console)，进入工作空间配置页面获取工作空间ID。"}, {"name": "Name", "type": "string", "description": "导出任务的名称。\\n\\n名称必须唯一，即当前DataWorks工作空间中不能存在名称重复的导出任务。"}, {"name": "ExportMode", "type": "string", "description": "任务的导出模式，取值如下：\\n- FULL：全量导出目标任务。\\n- INCREMENTAL：从指定的时间点开始，增量导出目标任务。选择该模式时，需要同时配置IncrementalSince参数。"}, {"name": "IncrementalSince", "type": "integer", "description": "增量导出目标任务的起始时间。\\n\\n当ExportMode参数配置为INCREMENTAL时，IncrementalSince参数才生效。"}, {"name": "ExportObjectStatus", "type": "string", "description": "导出任务的状态。系统会根据所选状态导出指定状态的任务。取值如下：\\n- SAVED：保存状态，即导出已保存的任务。\\n- SUBMITTED：提交状态，即导出已提交的任务。\\n- DEPLOYED：发布状态，即导出已发布的任务。"}, {"name": "Description", "type": "string", "description": "导出任务的描述信息。"}]}\n\n{"function": "UpdateTicketNum", "description": "对用于免登嵌入报表的指定的ticket进行更新票据数量操作。", "arguments": [{"name": "Ticket", "type": "string", "description": "三方嵌入的票据值，即URL中的accessTicket值。"}, {"name": "TicketNum", "type": "integer", "description": "票据数。\\n- 取值范围：1~99998，建议值为1。"}]}\n\n{"function": "CreateSavepoint", "description": "创建快照", "arguments": [{"name": "workspace", "type": "string", "description": "工作空间ID。"}, {"name": "namespace", "type": "string", "description": "项目空间名称。"}, {"name": "body", "type": "object", "description": "触发savepoint参数。"}]}\n\n"\n 请以如下格式回复：:\n {\n "function": "function_name",\n "arguments": {\n "argument1": value1,\n "argument2": value2\n }\n }', 'role': 'system'}, {'content': ' "更新免登嵌入报表的票据数量为10的票据值为"abcd1234"。" ', 'role': 'user'}, {'content': '{\n "function": "UpdateTicketNum",\n "arguments": [\n {\n "Ticket": "abcd1234",\n "TicketNum": 10\n }\n ]\n}', 'role': 'assistant'}]} 修改训练程序的数据集加载: ```python agent_flan = dict( type=process_hf_dataset, # dataset=dict(type=load_from_disk, dataset_path=agent_flan_path), dataset=dict(type=load_dataset, path="/code/function-calling-small-convert", data_files={ "train": "xtuner_train.json" }), tokenizer=tokenizer, max_length=max_length, dataset_map_fn=openai_map_fn, template_map_fn=dict( type=template_map_fn_factory, template=prompt_template), remove_unused_columns=True, shuffle_before_pack=True, pack_to_max_length=pack_to_max_length, use_varlen_attn=use_varlen_attn) ``` 在模型调试测试训练程序：NPROC_PER_NODE=4 xtuner train /code/llama3_8b_instruct_qlora_function_calling_small.py --work-dir /userhome/llama3-8b-ft/agent-flan --deepspeed deepspeed_zero3_offload 运行顺利： [![3-1.png](/attachments/adbe8eeb-9606-48a4-83d8-b5cce7cca91a)](url) 按照课堂示例进入训练管理，然后启动任务，任务用6个节点开启，等待训练结果, 预计1天2小时训练结束： [![3-2.png](/attachments/0696e9ab-7389-4ce0-91ab-91b634e2fad4)](url) [![3-3.png](attachments/ccf3411a-f895-41cb-99c8-dbed0e90aaca)](url) 1010次循环以后训练报错 ``` E1210 02:10:43.721180 1121 ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out. I1210 02:10:43.721319 1121 ProcessGroupNCCL.cpp:874] [Rank 2] Destroyed 1communicators on CUDA device 2 E1210 02:10:43.721338 1121 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. E1210 02:10:43.721356 1121 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down. ee1f5a08a17a4c27939b5c125b56deef-task0-0:135:1121 [0] NCCL INFO comm 0x7f4993b43b60 rank 2 nranks 24 cudaDev 2 busId 43000 - Abort COMPLETE E1210 02:10:43.723716 1120 ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800606 milliseconds before timing out. ``` 查询得知可能是time out，尝试改善timeout的限制和通讯延迟，在run.sh里面加入： ``` export NCCL_SOCKET_IFNAME=^lo,docker0 export NCCL_IB_DISABLE=0 export NCCL_P2P_LEVEL=SYS export NCCL_TIMEOUT=3600 ``` 再次训练可以成功训练至300次循环，类似的报错再次发生： ``` 12/10 05:47:27 - mmengine - INFO - Iter(train) [ 300/2628] lr: 1.9632e-04 eta: 22:38:53 time: 35.4292 data_time: 0.0181 memory: 6900 loss: 0.0909 tflops: 0.7778 tokens_per_sec: 11.8040 E1210 06:18:45.497431 192 ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=113741, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800131 milliseconds before timing out. I1210 06:18:45.497562 192 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1 ``` 查看训练日志，发现loss迅速减小以后保持稳定，可能有过拟合的现象，但是没有完整训练过，不能判断，需要尝试在150次循环第一次降到0.1以下之前保存数据： ``` 2024/12/10 02:58:27 - mmengine - INFO - Iter(train) [ 10/2628] lr: 2.3378e-05 eta: 1 day, 2:38:46 time: 36.6411 data_time: 0.0172 memory: 6313 loss: 0.5707 tflops: 1.4485 tokens_per_sec: 21.8737 2024/12/10 03:04:15 - mmengine - INFO - Iter(train) [ 20/2628] lr: 4.9352e-05 eta: 1 day, 1:53:25 time: 34.8358 data_time: 0.0189 memory: 6773 loss: 0.7009 tflops: 1.4446 tokens_per_sec: 21.8108 2024/12/10 03:09:59 - mmengine - INFO - Iter(train) [ 30/2628] lr: 7.5326e-05 eta: 1 day, 1:28:08 time: 34.3990 data_time: 0.0173 memory: 6760 loss: 0.7877 tflops: 1.2960 tokens_per_sec: 19.5886 2024/12/10 03:15:52 - mmengine - INFO - Iter(train) [ 40/2628] lr: 1.0130e-04 eta: 1 day, 1:21:42 time: 35.2408 data_time: 0.0167 memory: 6406 loss: 0.5262 tflops: 0.7930 tokens_per_sec: 12.0395 2024/12/10 03:21:43 - mmengine - INFO - Iter(train) [ 50/2628] lr: 1.2727e-04 eta: 1 day, 1:15:03 time: 35.1888 data_time: 0.0177 memory: 6323 loss: 0.6106 tflops: 2.4542 tokens_per_sec: 36.7325 2024/12/10 03:27:28 - mmengine - INFO - Iter(train) [ 60/2628] lr: 1.5325e-04 eta: 1 day, 1:03:05 time: 34.4088 data_time: 0.0175 memory: 6887 loss: 0.4257 tflops: 1.2652 tokens_per_sec: 19.1390 2024/12/10 03:33:21 - mmengine - INFO - Iter(train) [ 70/2628] lr: 1.7922e-04 eta: 1 day, 0:58:26 time: 35.3171 data_time: 0.0167 memory: 6907 loss: 0.2403 tflops: 2.8644 tokens_per_sec: 42.7948 2024/12/10 03:39:08 - mmengine - INFO - Iter(train) [ 80/2628] lr: 2.0000e-04 eta: 1 day, 0:50:29 time: 34.7514 data_time: 0.0178 memory: 6888 loss: 0.1763 tflops: 1.3519 tokens_per_sec: 20.4296 2024/12/10 03:44:53 - mmengine - INFO - Iter(train) [ 90/2628] lr: 1.9999e-04 eta: 1 day, 0:41:45 time: 34.4841 data_time: 0.0172 memory: 6907 loss: 0.1485 tflops: 1.6709 tokens_per_sec: 25.1992 2024/12/10 03:50:53 - mmengine - INFO - Iter(train) [ 100/2628] lr: 1.9997e-04 eta: 1 day, 0:40:03 time: 36.0128 data_time: 0.0173 memory: 6027 loss: 0.1951 tflops: 1.9293 tokens_per_sec: 28.9919 2024/12/10 03:56:40 - mmengine - INFO - Iter(train) [ 110/2628] lr: 1.9993e-04 eta: 1 day, 0:32:40 time: 34.7252 data_time: 0.0173 memory: 6587 loss: 0.1932 tflops: 1.4727 tokens_per_sec: 22.2321 2024/12/10 04:02:32 - mmengine - INFO - Iter(train) [ 120/2628] lr: 1.9987e-04 eta: 1 day, 0:27:03 time: 35.1616 data_time: 0.0177 memory: 6830 loss: 0.1311 tflops: 2.8311 tokens_per_sec: 42.1947 2024/12/10 04:08:28 - mmengine - INFO - Iter(train) [ 130/2628] lr: 1.9980e-04 eta: 1 day, 0:22:40 time: 35.5532 data_time: 0.0171 memory: 6503 loss: 0.2549 tflops: 3.2039 tokens_per_sec: 47.8660 2024/12/10 04:14:25 - mmengine - INFO - Iter(train) [ 140/2628] lr: 1.9972e-04 eta: 1 day, 0:18:38 time: 35.7463 data_time: 0.0171 memory: 6782 loss: 0.1906 tflops: 1.4399 tokens_per_sec: 21.7591 2024/12/10 04:20:11 - mmengine - INFO - Iter(train) [ 150/2628] lr: 1.9962e-04 eta: 1 day, 0:11:17 time: 34.6388 data_time: 0.0166 memory: 6466 loss: 0.1710 tflops: 2.1321 tokens_per_sec: 32.0057 2024/12/10 04:25:45 - mmengine - INFO - Iter(train) [ 160/2628] lr: 1.9950e-04 eta: 1 day, 0:00:49 time: 33.3457 data_time: 0.0181 memory: 6334 loss: 0.0438 tflops: 2.4341 tokens_per_sec: 36.4196 2024/12/10 04:31:32 - mmengine - INFO - Iter(train) [ 170/2628] lr: 1.9937e-04 eta: 23:54:20 time: 34.7592 data_time: 0.0169 memory: 5750 loss: 0.0409 tflops: 1.6331 tokens_per_sec: 24.6103 2024/12/10 04:37:19 - mmengine - INFO - Iter(train) [ 180/2628] lr: 1.9923e-04 eta: 23:47:34 time: 34.6040 data_time: 0.0168 memory: 6496 loss: 0.0831 tflops: 2.8580 tokens_per_sec: 42.7016 2024/12/10 04:43:18 - mmengine - INFO - Iter(train) [ 190/2628] lr: 1.9907e-04 eta: 23:43:52 time: 35.9864 data_time: 0.0164 memory: 5969 loss: 0.1775 tflops: 2.1151 tokens_per_sec: 31.7504 2024/12/10 04:49:16 - mmengine - INFO - Iter(train) [ 200/2628] lr: 1.9889e-04 eta: 23:39:33 time: 35.7930 data_time: 0.0168 memory: 6208 loss: 0.0859 tflops: 2.0598 tokens_per_sec: 30.9505 2024/12/10 04:55:06 - mmengine - INFO - Iter(train) [ 210/2628] lr: 1.9870e-04 eta: 23:33:32 time: 34.9886 data_time: 0.0178 memory: 6819 loss: 0.0701 tflops: 1.3770 tokens_per_sec: 20.8340 2024/12/10 05:00:57 - mmengine - INFO - Iter(train) [ 220/2628] lr: 1.9850e-04 eta: 23:27:46 time: 35.1225 data_time: 0.0173 memory: 6901 loss: 0.0907 tflops: 3.2577 tokens_per_sec: 48.4986 2024/12/10 05:06:44 - mmengine - INFO - Iter(train) [ 230/2628] lr: 1.9827e-04 eta: 23:21:16 time: 34.6971 data_time: 0.0172 memory: 6713 loss: 0.1388 tflops: 1.5842 tokens_per_sec: 23.8815 2024/12/10 05:12:37 - mmengine - INFO - Iter(train) [ 240/2628] lr: 1.9804e-04 eta: 23:15:45 time: 35.2600 data_time: 0.0166 memory: 6791 loss: 0.0723 tflops: 0.9565 tokens_per_sec: 14.5092 2024/12/10 05:18:12 - mmengine - INFO - Iter(train) [ 250/2628] lr: 1.9779e-04 eta: 23:07:28 time: 33.5367 data_time: 0.0163 memory: 5862 loss: 0.0879 tflops: 1.7685 tokens_per_sec: 26.6434 2024/12/10 05:24:14 - mmengine - INFO - Iter(train) [ 260/2628] lr: 1.9752e-04 eta: 23:03:27 time: 36.2075 data_time: 0.0165 memory: 6623 loss: 0.0272 tflops: 1.4504 tokens_per_sec: 21.8761 2024/12/10 05:30:01 - mmengine - INFO - Iter(train) [ 270/2628] lr: 1.9724e-04 eta: 22:57:02 time: 34.6540 data_time: 0.0172 memory: 6500 loss: 0.0947 tflops: 1.7753 tokens_per_sec: 26.7083 2024/12/10 05:35:46 - mmengine - INFO - Iter(train) [ 280/2628] lr: 1.9695e-04 eta: 22:50:26 time: 34.4970 data_time: 0.0165 memory: 6079 loss: 0.1222 tflops: 0.7416 tokens_per_sec: 11.2570 2024/12/10 05:41:33 - mmengine - INFO - Iter(train) [ 290/2628] lr: 1.9664e-04 eta: 22:44:11 time: 34.7083 data_time: 0.0168 memory: 6196 loss: 0.1291 tflops: 1.7407 tokens_per_sec: 26.2408 2024/12/10 05:47:27 - mmengine - INFO - Iter(train) [ 300/2628] lr: 1.9632e-04 eta: 22:38:53 time: 35.4292 data_time: 0.0181 memory: 6900 loss: 0.0909 tflops: 0.7778 tokens_per_sec: 11.8040 ``` 在llama3_8b_instruct_qlora_agentflan_3e.py里面修改存储的循环次数 ``` # Save save_steps = 150 ``` 再去xtuner包下面找到控制循环的程序： /opt/conda/lib/python3.10/site-packages/xtuner/engine/runner/loop.py 添加NCCL debug，在循环中加入缓存清理： ``` # Enable NCCL Debugging os.environ["NCCL_DEBUG"] = "INFO" os.environ["NCCL_DEBUG_SUBSYS"] = "ALL" ``` ``` def run_iter(self, data_batch): """Override run_iter to add memory clearing and debugging.""" # Clear GPU cache before starting the iteration torch.cuda.empty_cache() # Log memory usage every 10 iterations if self.runner.iter % 10 == 0: print(f"Iteration {self.runner.iter}:") print(torch.cuda.memory_summary()) # Call parent method to process the iteration super().run_iter(data_batch) ``` 再次尝试训练训练到1010循环，还是有报错: ``` 12/11 08:24:30 - mmengine - INFO - Iter(train) [1010/2628] lr: 1.4112e-04 eta: 18:40:49 time: 102.0982 data_time: 68.2270 memory: 6693 loss: 0.0225 tflops: 2.1544 tokens_per_sec: 32.3661 kb5b6855e24e4d428aedc6f286628b24-task0-0:134:192 [0] NCCL INFO comm 0x7f330c001500 rank 1 nranks 24 cudaDev 1 busId 26000 - Abort COMPLETE E1211 08:55:45.695231 192 ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800382 milliseconds before timing out. I1211 08:55:45.695369 192 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1 E1211 08:55:45.695441 192 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. E1211 08:55:45.695462 192 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down. E1211 08:55:45.700608 192 ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800382 milliseconds before timing out. E1211 08:55:45.807060 198 ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800348 milliseconds before timing out. I1211 08:55:45.807195 198 ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0 E1211 08:55:45.807214 198 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. kb5b6855e24e4d428aedc6f286628b24-task0-0:133:198 [0] NCCL INFO comm 0x7fd774001500 rank 0 nranks 24 cudaDev 0 busId 4000 - Abort COMPLETE E1211 08:55:45.807231 198 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down. E1211 08:55:45.807571 198 ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800348 milliseconds before timing out. E1211 08:55:46.013504 195 ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800798 milliseconds before timing out. I1211 08:55:46.013639 195 ProcessGroupNCCL.cpp:874] [Rank 2] Destroyed 1communicators on CUDA device 2 E1211 08:55:46.013659 195 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. E1211 08:55:46.013676 195 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down. kb5b6855e24e4d428aedc6f286628b24-task0-0:135:195 [0] NCCL INFO comm 0x7fde3c001500 rank 2 nranks 24 cudaDev 2 busId 43000 - Abort COMPLETE E1211 08:55:46.014129 195 ProcessGroupNCCL.cpp:915] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800798 milliseconds before timing out. E1211 08:55:47.450834 189 ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800780 milliseconds before timing out. I1211 08:55:47.450953 189 ProcessGroupNCCL.cpp:874] [Rank 3] Destroyed 1communicators on CUDA device 3 E1211 08:55:47.450974 189 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. E1211 08:55:47.450995 189 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down. kb5b6855e24e4d428aedc6f286628b24-task0-0:136:189 [0] NCCL INFO comm 0x7fc1e8001500 rank 3 nranks 24 cudaDev 3 busId 63000 - Abort COMPLETE E1211 08:55:47.451445 189 ProcessGroupNCCL.cpp:915] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1079605, OpType=_ALLGATHER_BASE, NumelIn=2523138, NumelOut=60555312, Timeout(ms)=1800000) ran for 1800780 milliseconds before timing out. [2024-12-11 08:55:49,817] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 133) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/lib/python3.10/site-packages/xtuner/tools/train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-12-11_08:55:49 host : kb5b6855e24e4d428aedc6f286628b24-task0-0.kb5b6855e24e4d428aedc6f286628b24.f589a03bd5a3490f959fa89692cd2c3b.svc.cluster.local rank : 1 (local_rank: 1) exitcode : -6 (pid: 134) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 134 [2]: time : 2024-12-11_08:55:49 host : kb5b6855e24e4d428aedc6f286628b24-task0-0.kb5b6855e24e4d428aedc6f286628b24.f589a03bd5a3490f959fa89692cd2c3b.svc.cluster.local rank : 2 (local_rank: 2) exitcode : -6 (pid: 135) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 135 [3]: time : 2024-12-11_08:55:49 host : kb5b6855e24e4d428aedc6f286628b24-task0-0.kb5b6855e24e4d428aedc6f286628b24.f589a03bd5a3490f959fa89692cd2c3b.svc.cluster.local rank : 3 (local_rank: 3) exitcode : -6 (pid: 136) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 136 ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-12-11_08:55:49 host : kb5b6855e24e4d428aedc6f286628b24-task0-0.kb5b6855e24e4d428aedc6f286628b24.f589a03bd5a3490f959fa89692cd2c3b.svc.cluster.local rank : 0 (local_rank: 0) exitcode : -6 (pid: 133) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 133 ============================================================ ```

3-1.png

109 KiB

3-2.png

53 KiB

3-3.png

126 KiB

11175663820cs changed title from ~~Function Call实践（三）——微调Llama3-8B-Instruct模型~~ to 【求助帖】Function Call实践（三）——微调Llama3-8B-Instruct模型(训练到中间会有内存压力而停止）

2024-12-11 19:55:34 +08:00

Sign in to join this conversation.

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_share#53