【作业贴】2025-3-14 AGI2406期 function call多机多卡(下) 使用xtuner进行微调训练 #86
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
debug 过程log:
03/18 11:04:59 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
03/18 11:04:59 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
03/18 11:04:59 - mmengine - INFO - Checkpoints will be saved to /userhome/llama3-8b-ft/agent-flan.
I0318 11:05:37.564574 1455 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
03/18 11:10:43 - mmengine - INFO - Iter(train) [ 10/25764] lr: 2.3366e-06 eta: 10 days, 6:04:05 time: 34.3964 data_time: 0.0206 memory: 6909 loss: 0.6395 tflops: 2.9581 tokens_per_sec: 44.1279
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@3/torch/csrc/tensor/python_tensor.cpp:85.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
[2025-03-18 11:14:12,695] [WARNING] [stage3.py:1998:step] 5 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
03/18 11:16:38 - mmengine - INFO - Iter(train) [ 20/25764] lr: 4.9306e-06 eta: 10 days, 10:07:35 time: 35.5581 data_time: 0.0234 memory: 6910 loss: 0.8758 tflops: 2.3413 tokens_per_sec: 35.0201
03/18 11:22:29 - mmengine - INFO - Iter(train) [ 30/25764] lr: 7.5246e-06 eta: 10 days, 10:09:45 time: 35.0333 data_time: 0.0205 memory: 6911 loss: 0.7478 tflops: 1.0568 tokens_per_sec: 16.0082
[2025-03-18 11:23:39,626] [WARNING] [stage3.py:1998:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
03/18 11:28:17 - mmengine - INFO - Iter(train) [ 40/25764] lr: 1.0119e-05 eta: 10 days, 9:49:49 time: 34.8643 data_time: 0.0241 memory: 6909 loss: 1.1097 tflops: 3.2163 tokens_per_sec: 47.8804
^[[B^[[B[2025-03-18 11:32:40,909] [WARNING] [stage3.py:1998:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time