【作业贴】29-25.3.14-function call多机多卡训练（下） #89

New Issue

11583226719cs · 2025-03-22T19:37:20+08:00

11583226719cs commented

2025-03-22 19:37:20 +08:00

训练结果如下：

03/22 04:16:09 - mmengine - INFO - Iter(train) [4250/4296] lr: 6.2743e-08 eta: 0:27:58 time: 35.6854 data_time: 0.0198 memory: 6911 loss: 0.2638 tflops: 3.2195 tokens_per_sec: 47.9285
[2025-03-22 04:19:37,131] [WARNING] [stage3.py:1998:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
03/22 04:22:03 - mmengine - INFO - Iter(train) [4260/4296] lr: 3.8886e-08 eta: 0:21:53 time: 35.4227 data_time: 0.0193 memory: 6911 loss: 0.2344 tflops: 2.9454 tokens_per_sec: 43.8467
03/22 04:27:48 - mmengine - INFO - Iter(train) [4270/4296] lr: 2.0707e-08 eta: 0:15:48 time: 34.4737 data_time: 0.0188 memory: 6911 loss: 0.3555 tflops: 2.7921 tokens_per_sec: 41.7145
[2025-03-22 04:29:00,366] [WARNING] [stage3.py:1998:step] 7 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
03/22 04:33:55 - mmengine - INFO - Iter(train) [4280/4296] lr: 8.2093e-09 eta: 0:09:43 time: 36.7315 data_time: 0.0190 memory: 6909 loss: 0.3427 tflops: 2.9707 tokens_per_sec: 44.2234
[2025-03-22 04:38:41,017] [WARNING] [stage3.py:1998:step] 8 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
03/22 04:39:55 - mmengine - INFO - Iter(train) [4290/4296] lr: 1.3919e-09 eta: 0:03:38 time: 36.0362 data_time: 0.0185 memory: 6909 loss: 0.2530 tflops: 2.7508 tokens_per_sec: 41.0541
03/22 04:43:26 - mmengine - INFO - after_train_iter in EvaluateChatHook.
03/22 04:49:58 - mmengine - INFO - Sample output:
<|start_header_id|>system<|end_header_id|>

You are a assistant who can utilize external tools.
[{'name': 'ArxivSearch.get_arxiv_article_information','description': 'Run Arxiv search and get the article meta information.','parameters': [{'name': 'query', 'type': 'STRING', 'description':'the content of search query'}],'required': ['query'],'return_data':[{'name': 'content', 'description': 'a list of 3 arxiv search papers', 'type': 'STRING'}],'parameter_description': 'If you call this tool, you must pass arguments in the JSON format{key: value}, where the key is the parameter name.'}]
To use a tool, please use the following format:

Thought:Think what you need to solve, do you need to use tools?
Action:the tool name, should be one of [['ArxivSearch']]
Action Input:the input to the action

The response after utilizing tools should using the following format:

Response:the results after call the tool.

If you already know the answer, or you do not need to use tools,
please using the following format to reply:

Thought:the thought process to get the final answer
Final Answer:final answer

Thought:Sure, I can help you with that. Could you please provide me with more details about the InternLM2 Technical Report you want to search for? Such as the author or any specific keywords related to the report.
Action:ArxivSearch.get_arxiv_article_information
Action Input:query=InternLM2 Technical Report<|eot_id|>

03/22 04:49:58 - mmengine - INFO - Saving checkpoint at 4296 iterations
03/22 04:51:04 - mmengine - INFO - after_train in EvaluateChatHook.
03/22 04:57:42 - mmengine - INFO - Sample output:
<|start_header_id|>system<|end_header_id|>

You are a assistant who can utilize external tools.
[{'name': 'ArxivSearch.get_arxiv_article_information','description': 'Run Arxiv search and get the article meta information.','parameters': [{'name': 'query', 'type': 'STRING', 'description':'the content of search query'}],'required': ['query'],'return_data':[{'name': 'content', 'description': 'a list of 3 arxiv search papers', 'type': 'STRING'}],'parameter_description': 'If you call this tool, you must pass arguments in the JSON format{key: value}, where the key is the parameter name.'}]
To use a tool, please use the following format:

Thought:Think what you need to solve, do you need to use tools?
Action:the tool name, should be one of [['ArxivSearch']]
Action Input:the input to the action

The response after utilizing tools should using the following format:

Response:the results after call the tool.

If you already know the answer, or you do not need to use tools,
please using the following format to reply:

Thought:the thought process to get the final answer
Final Answer:final answer

Thought:Sure, I can help you with that. Could you please provide me with more details about the InternLM2 Technical Report you want to search for? Such as the author or any specific keywords related to the report.
Action:ArxivSearch.get_arxiv_article_information
Action Input:query=InternLM2 Technical Report<|eot_id|>

训练结果如下： <img width="1382" alt="image" src="/attachments/cc0636e6-8b6f-4b1f-906d-f04ad4d16537"> <img width="1390" alt="image" src="/attachments/b1626d78-394b-4421-991f-ddbe37c83466"> 03/22 04:16:09 - mmengine - INFO - Iter(train) [4250/4296] lr: 6.2743e-08 eta: 0:27:58 time: 35.6854 data_time: 0.0198 memory: 6911 loss: 0.2638 tflops: 3.2195 tokens_per_sec: 47.9285 [2025-03-22 04:19:37,131] [WARNING] [stage3.py:1998:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time 03/22 04:22:03 - mmengine - INFO - Iter(train) [4260/4296] lr: 3.8886e-08 eta: 0:21:53 time: 35.4227 data_time: 0.0193 memory: 6911 loss: 0.2344 tflops: 2.9454 tokens_per_sec: 43.8467 03/22 04:27:48 - mmengine - INFO - Iter(train) [4270/4296] lr: 2.0707e-08 eta: 0:15:48 time: 34.4737 data_time: 0.0188 memory: 6911 loss: 0.3555 tflops: 2.7921 tokens_per_sec: 41.7145 [2025-03-22 04:29:00,366] [WARNING] [stage3.py:1998:step] 7 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time 03/22 04:33:55 - mmengine - INFO - Iter(train) [4280/4296] lr: 8.2093e-09 eta: 0:09:43 time: 36.7315 data_time: 0.0190 memory: 6909 loss: 0.3427 tflops: 2.9707 tokens_per_sec: 44.2234 [2025-03-22 04:38:41,017] [WARNING] [stage3.py:1998:step] 8 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time 03/22 04:39:55 - mmengine - INFO - Iter(train) [4290/4296] lr: 1.3919e-09 eta: 0:03:38 time: 36.0362 data_time: 0.0185 memory: 6909 loss: 0.2530 tflops: 2.7508 tokens_per_sec: 41.0541 03/22 04:43:26 - mmengine - INFO - after_train_iter in EvaluateChatHook. 03/22 04:49:58 - mmengine - INFO - Sample output: <|start_header_id|>system<|end_header_id|> You are a assistant who can utilize external tools. [{'name': 'ArxivSearch.get_arxiv_article_information','description': 'Run Arxiv search and get the article meta information.','parameters': [{'name': 'query', 'type': 'STRING', 'description':'the content of search query'}],'required': ['query'],'return_data':[{'name': 'content', 'description': 'a list of 3 arxiv search papers', 'type': 'STRING'}],'parameter_description': 'If you call this tool, you must pass arguments in the JSON format{key: value}, where the key is the parameter name.'}] To use a tool, please use the following format: ``` Thought:Think what you need to solve, do you need to use tools? Action:the tool name, should be one of [['ArxivSearch']] Action Input:the input to the action ``` The response after utilizing tools should using the following format: ``` Response:the results after call the tool. ``` If you already know the answer, or you do not need to use tools, please using the following format to reply: ``` Thought:the thought process to get the final answer Final Answer:final answer ``` Begin!<|eot_id|><|start_header_id|>user<|end_header_id|> Please help me search the InternLM2 Technical Report.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Thought:Sure, I can help you with that. Could you please provide me with more details about the InternLM2 Technical Report you want to search for? Such as the author or any specific keywords related to the report. Action:ArxivSearch.get_arxiv_article_information Action Input:query=InternLM2 Technical Report<|eot_id|> 03/22 04:49:58 - mmengine - INFO - Saving checkpoint at 4296 iterations 03/22 04:51:04 - mmengine - INFO - after_train in EvaluateChatHook. 03/22 04:57:42 - mmengine - INFO - Sample output: <|start_header_id|>system<|end_header_id|> You are a assistant who can utilize external tools. [{'name': 'ArxivSearch.get_arxiv_article_information','description': 'Run Arxiv search and get the article meta information.','parameters': [{'name': 'query', 'type': 'STRING', 'description':'the content of search query'}],'required': ['query'],'return_data':[{'name': 'content', 'description': 'a list of 3 arxiv search papers', 'type': 'STRING'}],'parameter_description': 'If you call this tool, you must pass arguments in the JSON format{key: value}, where the key is the parameter name.'}] To use a tool, please use the following format: ``` Thought:Think what you need to solve, do you need to use tools? Action:the tool name, should be one of [['ArxivSearch']] Action Input:the input to the action ``` The response after utilizing tools should using the following format: ``` Response:the results after call the tool. ``` If you already know the answer, or you do not need to use tools, please using the following format to reply: ``` Thought:the thought process to get the final answer Final Answer:final answer ``` Begin!<|eot_id|><|start_header_id|>user<|end_header_id|> Please help me search the InternLM2 Technical Report.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Thought:Sure, I can help you with that. Could you please provide me with more details about the InternLM2 Technical Report you want to search for? Such as the author or any specific keywords related to the report. Action:ArxivSearch.get_arxiv_article_information Action Input:query=InternLM2 Technical Report<|eot_id|>

image.png

157 KiB

image.png

152 KiB

Sign in to join this conversation.

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_share#89