See axolotl config

axolotl version: 0.4.1

base_model: Dans-DiscountModels/Meta-Llama-3.1-8B-ChatML
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

trust_remote_code:

# wandb configuration
wandb_project: l3.1-8b-dans-instruct
wandb_watch:
wandb_run_id:
wandb_log_model: 

# where to save the finished model to
output_dir: ./l3.1-8b-dans-instruct

# dataset settings (local or huggingface repo)
datasets:
  - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
    type: sharegpt
    conversation: chatml
  - path: AquaV/Energetic-Materials-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: AquaV/Chemical-Biological-Safety-Applications-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: AquaV/US-Army-Survival-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: AquaV/Resistance-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: AquaV/Interrogation-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Mathmaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Benchmaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Codemaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Taskmaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-ASCIIMaxx-Wordart
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Prosemaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Toolmaxx
    type: sharegpt
    conversation: chatml

chat_template: chatml

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

load_in_8bit: false
load_in_4bit: false
strict: false

dataset_prepared_path: ./l3.1-8b-dans-instruct-data
val_set_size: 0.03

lora_model_dir: 

sequence_len: 8192

# use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
sample_packing: true
eval_sample_packing: true

# you can set these packing optimizations AFTER starting a training at least once.
# The trainer will provide recommended values for these values.

pad_to_sequence_len: true

#rope_scaling:
  #type:  # linear | dynamic
  #factor:  # float (2 for 2x)

adapter: # blank for full finetune
lora_r: 64
lora_alpha: 64
lora_dropout: 0.2
lora_target_linear: True
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head
lora_fan_in_fan_out:

gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0000015
cosine_min_lr_ratio: 

train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
auto_resume_from_checkpoints: true
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 15
eval_steps: 25
# save_steps: 100
saves_per_epoch: 3
debug: false
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:


special_tokens:
  pad_token: <|finetune_right_pad_id|>
  eos_token: <|im_end|>

l3.1-8b-dans-instruct

This model is a fine-tuned version of Dans-DiscountModels/Meta-Llama-3.1-8B-ChatML on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.6699

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1.5e-06
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 32
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 15
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss
0.9964	0.0041	1	1.0348
0.8433	0.1025	25	0.8220
0.7916	0.2049	50	0.7465
0.7381	0.3074	75	0.7152
0.6802	0.4098	100	0.7005
0.7764	0.5123	125	0.6917
0.6518	0.6148	150	0.6871
0.6864	0.7172	175	0.6831
0.7217	0.8197	200	0.6803
0.7072	0.9221	225	0.6781
0.6953	1.0287	250	0.6764
0.8013	1.1313	275	0.6752
0.6296	1.2338	300	0.6738
0.7553	1.3364	325	0.6729
0.6749	1.4390	350	0.6722
0.6619	1.5415	375	0.6715
0.6527	1.6441	400	0.6712
0.7654	1.7467	425	0.6707
0.7256	1.8492	450	0.6705
0.6921	1.9518	475	0.6701
0.6982	2.0523	500	0.6701
0.6997	2.1548	525	0.6701
0.6563	2.2574	550	0.6700
0.6564	2.3599	575	0.6699
0.6248	2.4624	600	0.6699
0.6893	2.5650	625	0.6699
0.6633	2.6675	650	0.6698
0.7045	2.7701	675	0.6698
0.7784	2.8726	700	0.6698
0.7798	2.9751	725	0.6699

Framework versions

Transformers 4.45.0.dev0
Pytorch 2.4.0+cu121
Datasets 2.21.0
Tokenizers 0.19.1

Dans-DiscountModels
/

Dans-Instruct-Mix-8b-ChatML-V0.0.2

l3.1-8b-dans-instruct

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Dans-DiscountModels/Dans-Instruct-Mix-8b-ChatML-V0.0.2

Evaluation results