metadata

base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_projector_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

Architecture: GPT2LMHeadModel
Total Parameters: 124,439,808
Data Type (dtype): torch.bfloat16
Model Size: 0.24 GB

Evaluation Metrics Comparison

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.25	61.25					11.6875	19.125
0	0	2473901162496.0	170424302305280.0	30.7740	30.0939	83.073	10.401	4060086272.0	71468255805440.0
2500	0.0404	1192.0	11840.0	9.8250	30.1508	82.916	10.381	772.0	15040.0
5000	0.0808	412.0	2240.0	8.3978	30.1808	82.834	10.371	290.0	438.0
7500	0.1212	245.0	908.0	7.6620	30.1603	82.891	10.378	219.0	198.0
10000	0.1616	182.0	672.0	7.2415	30.2587	82.621	10.344	165.0	204.0
12500	0.2020	132.0	504.0	6.6895	30.1682	82.869	10.375	115.0	155.0
15000	0.2424	113.0	436.0	6.4127	30.186	82.82	10.369	89.5	137.0
17500	0.2828	92.5	340.0	6.1945	30.108	83.035	10.396	71.0	132.0
20000	0.3232	74.0	278.0	5.9293	30.1455	82.931	10.383	63.25	134.0
22500	0.3636	66.0	215.0	5.6606	30.0869	83.093	10.403	50.5	81.5
25000	0.4040	63.25	189.0	5.5592	30.1385	82.95	10.385	44.0	72.5
27500	0.4444	59.0	202.0	5.4963	30.1334	82.964	10.387	40.5	79.0
30000	0.4848	59.75	198.0	5.4789	30.1924	82.802	10.367	42.25	63.75
32500	0.5253	58.75	177.0	5.4552	30.1133	83.02	10.394	40.25	56.5
35000	0.5657	57.5	167.0	5.3773	30.1179	83.007	10.393	36.0	51.0
37500	0.6061	57.5	161.0	5.3443	30.1249	82.988	10.39	37.75	53.25
40000	0.6465	54.5	159.0	5.3258	30.1211	82.998	10.391	34.25	59.0
42500	0.6869	55.25	150.0	5.2937	30.1886	82.813	10.368	35.75	50.75
45000	0.7273	50.5	132.0	5.1564	30.1176	83.008	10.393	30.125	42.75
47500	0.7677	50.75	123.0	5.1254	30.0774	83.119	10.406	29.375	37.5
50000	0.8081	50.0	123.5	5.1100	30.1068	83.038	10.396	28.75	39.0
52500	0.8485	49.0	120.0	5.0958	30.1022	83.05	10.398	29.125	35.0
55000	0.8889	48.75	117.5	5.0753	30.968	80.728	10.107	28.125	35.75
57500	0.9293	48.25	117.0	5.0696	30.0872	83.092	10.403	28.0	33.25
60000	0.9697	48.25	117.0	5.0655	30.1265	82.983	10.39	28.0	33.0
61875	1.0	48.25	117.0	5.0651	30.1098	83.03	10.395	28.0	33.25

Resource Usage Comparison

VRAM Use: 7.7843 GB

`# Distillation (Teacher -> Student) Architecture Difference:

Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
Total Parameters: 124,439,808 -> 124,439,808
Data Type (dtype): 124439808 -> torch.bfloat16
Model Size: 0.24 GB -> 0.24 GB

Module Diff Details

Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

Num Samples: 247,500
Subset: 20231101.en
Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=cos, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0
distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=cos, layer_mapper=layer-2))
train_embeddings: True
lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7f010c102dd0>
student_model_name_or_path: None
student_config_name_or_path: None
student_model_config: None
reinitialize_weights: None
copy_teacher_modules: [('lm_head', False)]
student_model_as_bitnet: True
student_model_compile: False
dropout: None
teacher_model_name_or_path: gpt2
teacher_load_in_8bit: False
teacher_load_in_4bit: False
teacher_model_compile: False
dataset_uri: wikimedia/wikipedia
dataset_subset: 20231101.en
dataset_split: train
dataset_column_name: text
dataset_sample_size: 250000
dataset_test_size: 0.01
gradient_accumulation_steps: 1
weight_decay: 0.0
max_grad_norm: 1.0
warmup_ratio: 0.5
warmup_steps: 0
gradient_checkpointing: True

Framework Versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0