This is incredible

#3
by Overbite1741 - opened

It's truly unbelievable how training on a such a small and diverse dataset could give such a good model. I think this deserves a deeper look on why this dataset mixture surpassed 100s of other finetunes.

I am working on reproducing this model and then do some ablative experiment. Is it possible for you to share the axolotl config or more details about the training? Also did you start from base mistral or some other finetune?

Thanks for your interest! I'm happy to share how I produced it - I'd love to get to the bottom of what made it work so well.

Here's the axolotl config I used:

base_model: mistralai/Mistral-7B-v0.1
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: chargoddard/PIPPA-Judged
    name: adequately_rated
    type: pippa
  - path: chargoddard/rpguild
    name: pruned
    type: rp_forum
    shards: 20
  - path: pankajmathur/orca_mini_v1_dataset
    type: orca_mini
    shards: 10
  - path: chargoddard/summarize_from_feedback_alpaca
    type: alpaca
    shards: 20
  - path: json
    data_files: /workspace/limaerp-8192.jsonl
    type: rp_forum
prompt_format: rpinstruct
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./mistral-rp-out
save_safetensors: true

adapter: lora
lora_model_dir:

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

total_num_tokens: 30637024
sample_packing_eff_est: 0.98

lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project: mistral-rp
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 4
eval_batch_size: 4
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
logging_steps: 1
flash_attention: true

warmup_steps: 10
eval_steps: 0.05
save_steps: 0.05
weight_decay: 0.0
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

It does look like I goofed a bit on the dataset split table - the proportion of summarize_from_feedback used is even lower than listed.

The limaerp-8192.jsonl mentioned is just a lightly preprocessed version of lemonilia's LimaRP dataset. I would upload it to huggingface but it's, uh, way too spicy for my tastes. You can download it here: https://files.catbox.moe/jj9srp.jsonl

I used my fork of axolotl with custom prompt handling. Specifically this commit was used to train the model. The way train_on_inputs, labels, and EOS tokkens are handled is different so it won't reproduce exactly on mainline axolotl. I could probably throw a pre-tokenized version of the dataset up if that's useful though.

Thanks. Will post the results of my ablative experiments here once I run them. Also I am using feature/rp branch of https://github.com/cg123/rathe/. Is that correct?

I am not able to reproduce the results. I checked out the commit ID you mentioned and installed rathe from the given branch and used the same axolotl config. Evaluation code is same for both your and reproduced model.

Your model benchmarks:

ARC: 66.72
Truthful: 59.86
winogrande: 79.16

Reproduced model benchmarks:

ARC: 61.00
truthful: 43.00
winogrande: 78.53

Overbite1741 changed discussion status to closed
Overbite1741 changed discussion status to open

Sign up or log in to comment