lapp0's picture
End of training
60db291 verified
metadata
base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_projector_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

  • Architecture: GPT2LMHeadModel
  • Total Parameters: 124,439,808
  • Data Type (dtype): torch.bfloat16
  • Model Size: 0.24 GB

Evaluation Metrics Comparison

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 43.25 61.25 11.6875 19.125
0 0 2473901162496.0 170424302305280.0 30.7740 30.0939 83.073 10.401 4060086272.0 71468255805440.0
2500 0.0404 1192.0 11840.0 9.8250 30.1508 82.916 10.381 772.0 15040.0
5000 0.0808 412.0 2240.0 8.3978 30.1808 82.834 10.371 290.0 438.0
7500 0.1212 245.0 908.0 7.6620 30.1603 82.891 10.378 219.0 198.0
10000 0.1616 182.0 672.0 7.2415 30.2587 82.621 10.344 165.0 204.0
12500 0.2020 132.0 504.0 6.6895 30.1682 82.869 10.375 115.0 155.0
15000 0.2424 113.0 436.0 6.4127 30.186 82.82 10.369 89.5 137.0
17500 0.2828 92.5 340.0 6.1945 30.108 83.035 10.396 71.0 132.0
20000 0.3232 74.0 278.0 5.9293 30.1455 82.931 10.383 63.25 134.0
22500 0.3636 66.0 215.0 5.6606 30.0869 83.093 10.403 50.5 81.5
25000 0.4040 63.25 189.0 5.5592 30.1385 82.95 10.385 44.0 72.5
27500 0.4444 59.0 202.0 5.4963 30.1334 82.964 10.387 40.5 79.0
30000 0.4848 59.75 198.0 5.4789 30.1924 82.802 10.367 42.25 63.75
32500 0.5253 58.75 177.0 5.4552 30.1133 83.02 10.394 40.25 56.5
35000 0.5657 57.5 167.0 5.3773 30.1179 83.007 10.393 36.0 51.0
37500 0.6061 57.5 161.0 5.3443 30.1249 82.988 10.39 37.75 53.25
40000 0.6465 54.5 159.0 5.3258 30.1211 82.998 10.391 34.25 59.0
42500 0.6869 55.25 150.0 5.2937 30.1886 82.813 10.368 35.75 50.75
45000 0.7273 50.5 132.0 5.1564 30.1176 83.008 10.393 30.125 42.75
47500 0.7677 50.75 123.0 5.1254 30.0774 83.119 10.406 29.375 37.5
50000 0.8081 50.0 123.5 5.1100 30.1068 83.038 10.396 28.75 39.0
52500 0.8485 49.0 120.0 5.0958 30.1022 83.05 10.398 29.125 35.0
55000 0.8889 48.75 117.5 5.0753 30.968 80.728 10.107 28.125 35.75
57500 0.9293 48.25 117.0 5.0696 30.0872 83.092 10.403 28.0 33.25
60000 0.9697 48.25 117.0 5.0655 30.1265 82.983 10.39 28.0 33.0
61875 1.0 48.25 117.0 5.0651 30.1098 83.03 10.395 28.0 33.25

Resource Usage Comparison

  • VRAM Use: 7.7843 GB

`# Distillation (Teacher -> Student) Architecture Difference:

  • Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
  • Total Parameters: 124,439,808 -> 124,439,808
  • Data Type (dtype): 124439808 -> torch.bfloat16
  • Model Size: 0.24 GB -> 0.24 GB
Module Diff Details


Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

  • Num Samples: 247,500
  • Subset: 20231101.en
  • Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=cos, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.5
  • num_epochs: 1.0
  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=cos, layer_mapper=layer-2))
  • train_embeddings: True
  • lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7f010c102dd0>
  • student_model_name_or_path: None
  • student_config_name_or_path: None
  • student_model_config: None
  • reinitialize_weights: None
  • copy_teacher_modules: [('lm_head', False)]
  • student_model_as_bitnet: True
  • student_model_compile: False
  • dropout: None
  • teacher_model_name_or_path: gpt2
  • teacher_load_in_8bit: False
  • teacher_load_in_4bit: False
  • teacher_model_compile: False
  • dataset_uri: wikimedia/wikipedia
  • dataset_subset: 20231101.en
  • dataset_split: train
  • dataset_column_name: text
  • dataset_sample_size: 250000
  • dataset_test_size: 0.01
  • gradient_accumulation_steps: 1
  • weight_decay: 0.0
  • max_grad_norm: 1.0
  • warmup_ratio: 0.5
  • warmup_steps: 0
  • gradient_checkpointing: True

Framework Versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0