【求助帖】大语言模型训练篇：提示词工程-多机多卡微调及fastgpt模型部署--训练管理报错 #736

New Issue

11186298837cs · 2025-05-25T17:11:30+08:00

11186298837cs commented

2025-05-25 17:11:30 +08:00

修改了模型和数据集路径，自己从huggingface下载的，/dataset下面没有



训练日志如下：
2025/05/25 12:26:46 - mmengine - INFO -

System environment:
sys.platform: linux
Python: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 50672711
GPU 0,1,2,3: Z100SM
CUDA_HOME: /opt/dtk
NVCC: Not Available
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201703
Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
HIP Runtime 5.7.24164
MIOpen 2.15.4
Magma 2.7.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=1, USE_GLOG=1, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=1, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=ON,

TorchVision: 0.16.0
OpenCV: 4.9.0
MMEngine: 0.10.3

Runtime environment:
launcher: pytorch
randomness: {'seed': None, 'deterministic': False}
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: None
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 8

2025/05/25 12:26:47 - mmengine - INFO - Config:
SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.sql'
accumulative_counts = 16
batch_size = 1
betas = (
0.9,
0.999,
)
custom_hooks = [
dict(
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path=
'/code/huggingface-cache/hub/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.engine.hooks.DatasetInfoHook'),
dict(
evaluation_inputs=[
'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.',
'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。',
],
every_n_iters=500,
prompt_template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
system='xtuner.utils.SYSTEM_TEMPLATE.sql',
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path=
'/code/huggingface-cache/hub/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.engine.hooks.EvaluateChatHook'),
]
data_path = '/code/huggingface-cache/datasets/sql-create-context'
dataloader_num_workers = 0
default_hooks = dict(
checkpoint=dict(
by_epoch=False,
interval=500,
max_keep_ckpts=2,
type='mmengine.hooks.CheckpointHook'),
logger=dict(
interval=10,
log_metric_by_epoch=False,
type='mmengine.hooks.LoggerHook'),
param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'),
sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'),
timer=dict(type='mmengine.hooks.IterTimerHook'))
env_cfg = dict(
cudnn_benchmark=False,
dist_cfg=dict(backend='nccl'),
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
evaluation_freq = 500
evaluation_inputs = [
'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.',
'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。',
]
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False)
lr = 0.0002
max_epochs = 3
max_length = 2048
max_norm = 1
model = dict(
llm=dict(
pretrained_model_name_or_path=
'/code/huggingface-cache/hub/CodeLlama-7b-hf/',
quantization_config=dict(
bnb_4bit_compute_dtype='torch.float16',
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
llm_int8_has_fp16_weight=False,
llm_int8_threshold=6.0,
load_in_4bit=True,
load_in_8bit=False,
type='transformers.BitsAndBytesConfig'),
torch_dtype='torch.float16',
trust_remote_code=True,
type='transformers.AutoModelForCausalLM.from_pretrained'),
lora=dict(
bias='none',
lora_alpha=16,
lora_dropout=0.1,
r=64,
task_type='CAUSAL_LM',
type='peft.LoraConfig'),
type='xtuner.model.SupervisedFinetune',
use_varlen_attn=False)
optim_type = 'torch.optim.AdamW'
optim_wrapper = dict(
optimizer=dict(
betas=(
0.9,
0.999,
),
lr=0.0002,
type='torch.optim.AdamW',
weight_decay=0),
type='DeepSpeedOptimWrapper')
pack_to_max_length = False
param_scheduler = [
dict(
begin=0,
by_epoch=True,
convert_to_iter_based=True,
end=0.09,
start_factor=1e-05,
type='mmengine.optim.LinearLR'),
dict(
begin=0.09,
by_epoch=True,
convert_to_iter_based=True,
end=3,
eta_min=0.0,
type='mmengine.optim.CosineAnnealingLR'),
]
pretrained_model_name_or_path = '/code/huggingface-cache/hub/CodeLlama-7b-hf/'
prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat'
randomness = dict(deterministic=False, seed=None)
resume = False
runner_type = 'FlexibleRunner'
sampler = 'mmengine.dataset.DefaultSampler'
save_steps = 500
save_total_limit = 2
sequence_parallel_size = 1
strategy = dict(
config=dict(
bf16=dict(enabled=True),
fp16=dict(enabled=False, initial_scale_power=16),
gradient_accumulation_steps='auto',
gradient_clipping='auto',
train_micro_batch_size_per_gpu='auto',
zero_allow_untested_optimizer=True,
zero_force_ds_cpu_optimizer=False,
zero_optimization=dict(
offload_optimizer=dict(device='cpu', pin_memory=True),
offload_param=dict(device='cpu', pin_memory=True),
overlap_comm=True,
stage=3,
stage3_gather_16bit_weights_on_model_save=True)),
exclude_frozen_parameters=True,
gradient_accumulation_steps=16,
gradient_clipping=1,
sequence_parallel_size=1,
train_micro_batch_size_per_gpu=1,
type='xtuner.engine.DeepSpeedStrategy')
tokenizer = dict(
padding_side='right',
pretrained_model_name_or_path=
'/code/huggingface-cache/hub/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained')
train_cfg = dict(max_epochs=3, type='xtuner.engine.runner.TrainLoop')
train_dataloader = dict(
batch_size=1,
collate_fn=dict(
type='xtuner.dataset.collate_fns.default_collate_fn',
use_varlen_attn=False),
dataset=dict(
dataset=dict(
path='/code/huggingface-cache/datasets/sql-create-context',
type='datasets.load_dataset'),
dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
max_length=2048,
pack_to_max_length=False,
remove_unused_columns=True,
shuffle_before_pack=True,
template_map_fn=dict(
template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
type='xtuner.dataset.map_fns.template_map_fn_factory'),
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path=
'/code/huggingface-cache/hub/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.dataset.process_hf_dataset',
use_varlen_attn=False),
num_workers=0,
sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler'))
train_dataset = dict(
dataset=dict(
path='/code/huggingface-cache/datasets/sql-create-context',
type='datasets.load_dataset'),
dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn',
max_length=2048,
pack_to_max_length=False,
remove_unused_columns=True,
shuffle_before_pack=True,
template_map_fn=dict(
template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat',
type='xtuner.dataset.map_fns.template_map_fn_factory'),
tokenizer=dict(
padding_side='right',
pretrained_model_name_or_path=
'/code/huggingface-cache/hub/CodeLlama-7b-hf/',
trust_remote_code=True,
type='transformers.AutoTokenizer.from_pretrained'),
type='xtuner.dataset.process_hf_dataset',
use_varlen_attn=False)
use_varlen_attn = False
visualizer = None
warmup_ratio = 0.03
weight_decay = 0
work_dir = '/code/xtuner-workdir'

2025/05/25 12:26:47 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
2025/05/25 12:26:47 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook

before_train:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DatasetInfoHook
(LOW ) EvaluateChatHook
(VERY_LOW ) CheckpointHook

before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook

before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook

after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(LOW ) EvaluateChatHook
(VERY_LOW ) CheckpointHook

after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

before_val:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook

before_val_epoch:
(NORMAL ) IterTimerHook

before_val_iter:
(NORMAL ) IterTimerHook

after_val_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

after_val:
(VERY_HIGH ) RuntimeInfoHook
(LOW ) EvaluateChatHook

after_train:
(VERY_HIGH ) RuntimeInfoHook
(LOW ) EvaluateChatHook
(VERY_LOW ) CheckpointHook

before_test:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook

before_test_epoch:
(NORMAL ) IterTimerHook

before_test_iter:
(NORMAL ) IterTimerHook

after_test_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test:
(VERY_HIGH ) RuntimeInfoHook

after_run:
(BELOW_NORMAL) LoggerHook

2025/05/25 12:26:47 - mmengine - INFO - xtuner_dataset_timeout = 0:30:00
2025/05/25 12:27:03 - mmengine - WARNING - Dataset Dataset has no metainfo. dataset_meta in visualizer will be None.

![image](/attachments/0b001a3f-dff1-455a-acf8-522ba71f1fc9) ![image](/attachments/c10673de-cea8-4fcb-a8dd-4434b57c5cef) ![image](/attachments/10633d32-02bb-47de-9388-a9198db99107) 修改了模型和数据集路径，自己从huggingface下载的，/dataset下面没有 ![image](/attachments/ccbf708b-a456-486a-a484-7f0bc6a7b3cf) ![image](/attachments/37ecfc27-2855-4b42-9222-3518e620f7ba) ![image](/attachments/5a0b3288-be18-4ecf-9885-42bce9aa4d27) 训练日志如下： 2025/05/25 12:26:46 - mmengine - INFO - ------------------------------------------------------------ System environment: sys.platform: linux Python: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 50672711 GPU 0,1,2,3: Z100SM CUDA_HOME: /opt/dtk NVCC: Not Available GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.1.0 PyTorch compiling details: PyTorch built with: - GCC 7.3 - C++ Version: 201703 - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX2 - HIP Runtime 5.7.24164 - MIOpen 2.15.4 - Magma 2.7.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=1, USE_GLOG=1, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=1, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=ON, TorchVision: 0.16.0 OpenCV: 4.9.0 MMEngine: 0.10.3 Runtime environment: launcher: pytorch randomness: {'seed': None, 'deterministic': False} cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: None deterministic: False Distributed launcher: pytorch Distributed training: True GPU number: 8 ------------------------------------------------------------ 2025/05/25 12:26:47 - mmengine - INFO - Config: SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.sql' accumulative_counts = 16 batch_size = 1 betas = ( 0.9, 0.999, ) custom_hooks = [ dict( tokenizer=dict( padding_side='right', pretrained_model_name_or_path= '/code/huggingface-cache/hub/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.hooks.DatasetInfoHook'), dict( evaluation_inputs=[ 'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.', 'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。', ], every_n_iters=500, prompt_template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat', system='xtuner.utils.SYSTEM_TEMPLATE.sql', tokenizer=dict( padding_side='right', pretrained_model_name_or_path= '/code/huggingface-cache/hub/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.hooks.EvaluateChatHook'), ] data_path = '/code/huggingface-cache/datasets/sql-create-context' dataloader_num_workers = 0 default_hooks = dict( checkpoint=dict( by_epoch=False, interval=500, max_keep_ckpts=2, type='mmengine.hooks.CheckpointHook'), logger=dict( interval=10, log_metric_by_epoch=False, type='mmengine.hooks.LoggerHook'), param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'), sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'), timer=dict(type='mmengine.hooks.IterTimerHook')) env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0)) evaluation_freq = 500 evaluation_inputs = [ 'CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\nFind the name, latitude, and city of stations with latitude above 50.', 'CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles INTEGER)\n找到mean_visibility_miles最大的zip_code。', ] launcher = 'pytorch' load_from = None log_level = 'INFO' log_processor = dict(by_epoch=False) lr = 0.0002 max_epochs = 3 max_length = 2048 max_norm = 1 model = dict( llm=dict( pretrained_model_name_or_path= '/code/huggingface-cache/hub/CodeLlama-7b-hf/', quantization_config=dict( bnb_4bit_compute_dtype='torch.float16', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, llm_int8_has_fp16_weight=False, llm_int8_threshold=6.0, load_in_4bit=True, load_in_8bit=False, type='transformers.BitsAndBytesConfig'), torch_dtype='torch.float16', trust_remote_code=True, type='transformers.AutoModelForCausalLM.from_pretrained'), lora=dict( bias='none', lora_alpha=16, lora_dropout=0.1, r=64, task_type='CAUSAL_LM', type='peft.LoraConfig'), type='xtuner.model.SupervisedFinetune', use_varlen_attn=False) optim_type = 'torch.optim.AdamW' optim_wrapper = dict( optimizer=dict( betas=( 0.9, 0.999, ), lr=0.0002, type='torch.optim.AdamW', weight_decay=0), type='DeepSpeedOptimWrapper') pack_to_max_length = False param_scheduler = [ dict( begin=0, by_epoch=True, convert_to_iter_based=True, end=0.09, start_factor=1e-05, type='mmengine.optim.LinearLR'), dict( begin=0.09, by_epoch=True, convert_to_iter_based=True, end=3, eta_min=0.0, type='mmengine.optim.CosineAnnealingLR'), ] pretrained_model_name_or_path = '/code/huggingface-cache/hub/CodeLlama-7b-hf/' prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.llama2_chat' randomness = dict(deterministic=False, seed=None) resume = False runner_type = 'FlexibleRunner' sampler = 'mmengine.dataset.DefaultSampler' save_steps = 500 save_total_limit = 2 sequence_parallel_size = 1 strategy = dict( config=dict( bf16=dict(enabled=True), fp16=dict(enabled=False, initial_scale_power=16), gradient_accumulation_steps='auto', gradient_clipping='auto', train_micro_batch_size_per_gpu='auto', zero_allow_untested_optimizer=True, zero_force_ds_cpu_optimizer=False, zero_optimization=dict( offload_optimizer=dict(device='cpu', pin_memory=True), offload_param=dict(device='cpu', pin_memory=True), overlap_comm=True, stage=3, stage3_gather_16bit_weights_on_model_save=True)), exclude_frozen_parameters=True, gradient_accumulation_steps=16, gradient_clipping=1, sequence_parallel_size=1, train_micro_batch_size_per_gpu=1, type='xtuner.engine.DeepSpeedStrategy') tokenizer = dict( padding_side='right', pretrained_model_name_or_path= '/code/huggingface-cache/hub/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained') train_cfg = dict(max_epochs=3, type='xtuner.engine.runner.TrainLoop') train_dataloader = dict( batch_size=1, collate_fn=dict( type='xtuner.dataset.collate_fns.default_collate_fn', use_varlen_attn=False), dataset=dict( dataset=dict( path='/code/huggingface-cache/datasets/sql-create-context', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn', max_length=2048, pack_to_max_length=False, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path= '/code/huggingface-cache/hub/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False), num_workers=0, sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler')) train_dataset = dict( dataset=dict( path='/code/huggingface-cache/datasets/sql-create-context', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.sql_map_fn', max_length=2048, pack_to_max_length=False, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.llama2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path= '/code/huggingface-cache/hub/CodeLlama-7b-hf/', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False) use_varlen_attn = False visualizer = None warmup_ratio = 0.03 weight_decay = 0 work_dir = '/code/xtuner-workdir' 2025/05/25 12:26:47 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized. 2025/05/25 12:26:47 - mmengine - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) RuntimeInfoHook (BELOW_NORMAL) LoggerHook -------------------- before_train: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DatasetInfoHook (LOW ) EvaluateChatHook (VERY_LOW ) CheckpointHook -------------------- before_train_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DistSamplerSeedHook -------------------- before_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook -------------------- after_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (LOW ) EvaluateChatHook (VERY_LOW ) CheckpointHook -------------------- after_train_epoch: (NORMAL ) IterTimerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook -------------------- before_val: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) DatasetInfoHook -------------------- before_val_epoch: (NORMAL ) IterTimerHook -------------------- before_val_iter: (NORMAL ) IterTimerHook -------------------- after_val_iter: (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook -------------------- after_val_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook -------------------- after_val: (VERY_HIGH ) RuntimeInfoHook (LOW ) EvaluateChatHook -------------------- after_train: (VERY_HIGH ) RuntimeInfoHook (LOW ) EvaluateChatHook (VERY_LOW ) CheckpointHook -------------------- before_test: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) DatasetInfoHook -------------------- before_test_epoch: (NORMAL ) IterTimerHook -------------------- before_test_iter: (NORMAL ) IterTimerHook -------------------- after_test_iter: (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook -------------------- after_test_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook -------------------- after_test: (VERY_HIGH ) RuntimeInfoHook -------------------- after_run: (BELOW_NORMAL) LoggerHook -------------------- 2025/05/25 12:26:47 - mmengine - INFO - xtuner_dataset_timeout = 0:30:00 2025/05/25 12:27:03 - mmengine - WARNING - Dataset Dataset has no metainfo. ``dataset_meta`` in visualizer will be None.

image.png

125 KiB

image.png

161 KiB

image.png

208 KiB

image.png

290 KiB

image.png

198 KiB

image.png

337 KiB

image.png

306 KiB

Sign in to join this conversation.

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_course#736

【求助帖】大语言模型训练篇：提示词工程-多机多卡微调及fastgpt模型部署--训练管理报错 #736

修改了模型和数据集路径，自己从huggingface下载的，/dataset下面没有 训练日志如下： 2025/05/25 12:26:46 - mmengine - INFO -

before_train: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DatasetInfoHook (LOW ) EvaluateChatHook (VERY_LOW ) CheckpointHook

before_train_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DistSamplerSeedHook

before_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook

after_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (LOW ) EvaluateChatHook (VERY_LOW ) CheckpointHook

after_train_epoch: (NORMAL ) IterTimerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook

before_val: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) DatasetInfoHook

before_val_epoch: (NORMAL ) IterTimerHook

before_val_iter: (NORMAL ) IterTimerHook

after_val_iter: (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook

after_val_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook

after_val: (VERY_HIGH ) RuntimeInfoHook (LOW ) EvaluateChatHook

after_train: (VERY_HIGH ) RuntimeInfoHook (LOW ) EvaluateChatHook (VERY_LOW ) CheckpointHook

before_test: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) DatasetInfoHook

before_test_epoch: (NORMAL ) IterTimerHook

before_test_iter: (NORMAL ) IterTimerHook

after_test_iter: (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook

after_test_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook

after_test: (VERY_HIGH ) RuntimeInfoHook

after_run: (BELOW_NORMAL) LoggerHook

修改了模型和数据集路径，自己从huggingface下载的，/dataset下面没有

训练日志如下：
2025/05/25 12:26:46 - mmengine - INFO -

before_train:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DatasetInfoHook
(LOW ) EvaluateChatHook
(VERY_LOW ) CheckpointHook

before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook

before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook

after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(LOW ) EvaluateChatHook
(VERY_LOW ) CheckpointHook

after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

before_val:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook

before_val_epoch:
(NORMAL ) IterTimerHook

before_val_iter:
(NORMAL ) IterTimerHook

after_val_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

after_val:
(VERY_HIGH ) RuntimeInfoHook
(LOW ) EvaluateChatHook

after_train:
(VERY_HIGH ) RuntimeInfoHook
(LOW ) EvaluateChatHook
(VERY_LOW ) CheckpointHook

before_test:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook

before_test_epoch:
(NORMAL ) IterTimerHook

before_test_iter:
(NORMAL ) IterTimerHook

after_test_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test:
(VERY_HIGH ) RuntimeInfoHook

after_run:
(BELOW_NORMAL) LoggerHook