【作业贴】2025-3-14 AGI2406期 function call多机多卡(下) 使用xtuner进行微调训练 #86

Open
opened 2025-03-18 11:33:42 +08:00 by 11583226719cs · 0 comments

debug 过程log:

03/18 11:04:59 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
03/18 11:04:59 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
03/18 11:04:59 - mmengine - INFO - Checkpoints will be saved to /userhome/llama3-8b-ft/agent-flan.
I0318 11:05:37.564574 1455 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
03/18 11:10:43 - mmengine - INFO - Iter(train) [ 10/25764] lr: 2.3366e-06 eta: 10 days, 6:04:05 time: 34.3964 data_time: 0.0206 memory: 6909 loss: 0.6395 tflops: 2.9581 tokens_per_sec: 44.1279
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
[2025-03-18 11:14:12,695] [WARNING] [stage3.py:1998:step] 5 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
03/18 11:16:38 - mmengine - INFO - Iter(train) [ 20/25764] lr: 4.9306e-06 eta: 10 days, 10:07:35 time: 35.5581 data_time: 0.0234 memory: 6910 loss: 0.8758 tflops: 2.3413 tokens_per_sec: 35.0201
03/18 11:22:29 - mmengine - INFO - Iter(train) [ 30/25764] lr: 7.5246e-06 eta: 10 days, 10:09:45 time: 35.0333 data_time: 0.0205 memory: 6911 loss: 0.7478 tflops: 1.0568 tokens_per_sec: 16.0082
[2025-03-18 11:23:39,626] [WARNING] [stage3.py:1998:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
03/18 11:28:17 - mmengine - INFO - Iter(train) [ 40/25764] lr: 1.0119e-05 eta: 10 days, 9:49:49 time: 34.8643 data_time: 0.0241 memory: 6909 loss: 1.1097 tflops: 3.2163 tokens_per_sec: 47.8804
^[[B^[[B[2025-03-18 11:32:40,909] [WARNING] [stage3.py:1998:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time

debug 过程log: 03/18 11:04:59 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io 03/18 11:04:59 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future. 03/18 11:04:59 - mmengine - INFO - Checkpoints will be saved to /userhome/llama3-8b-ft/agent-flan. I0318 11:05:37.564574 1455 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A 03/18 11:10:43 - mmengine - INFO - Iter(train) [ 10/25764] lr: 2.3366e-06 eta: 10 days, 6:04:05 time: 34.3964 data_time: 0.0206 memory: 6909 loss: 0.6395 tflops: 2.9581 tokens_per_sec: 44.1279 /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) [2025-03-18 11:14:12,695] [WARNING] [stage3.py:1998:step] 5 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time 03/18 11:16:38 - mmengine - INFO - Iter(train) [ 20/25764] lr: 4.9306e-06 eta: 10 days, 10:07:35 time: 35.5581 data_time: 0.0234 memory: 6910 loss: 0.8758 tflops: 2.3413 tokens_per_sec: 35.0201 03/18 11:22:29 - mmengine - INFO - Iter(train) [ 30/25764] lr: 7.5246e-06 eta: 10 days, 10:09:45 time: 35.0333 data_time: 0.0205 memory: 6911 loss: 0.7478 tflops: 1.0568 tokens_per_sec: 16.0082 [2025-03-18 11:23:39,626] [WARNING] [stage3.py:1998:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time 03/18 11:28:17 - mmengine - INFO - Iter(train) [ 40/25764] lr: 1.0119e-05 eta: 10 days, 9:49:49 time: 34.8643 data_time: 0.0241 memory: 6909 loss: 1.1097 tflops: 3.2163 tokens_per_sec: 47.8804 ^[[B^[[B[2025-03-18 11:32:40,909] [WARNING] [stage3.py:1998:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_share#86
No description provided.