distily_bench_gpt2_activation_loss

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 215.7906
eval_frwikippl: 1306.3361
eval_zhwikippl: 583.5945
eval_loss: 1.2753
eval_runtime: 34.5544
eval_samples_per_second: 57.88
eval_steps_per_second: 7.235

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: MultiObjective(logits_weight=1, logits_loss_fn=(fn:kl_divergence_loss()), hs_weight=0.2, hs_loss_fn=(fn:soft_cross_entropy_loss()), attn_weight=0, attn_loss_fn=(fn:soft_mse_loss()))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0904 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2086	57.2728					18.1784
0	0	55317.8945	54673.0039	5.9451	34.3869	58.162	7.27	59699.7266
1000	0.0404	730.1941	4645.5654	1.9777	34.4565	58.044	7.256	11835.6895
2000	0.0808	512.3346	3066.7913	1.7886	34.514	57.947	7.243	2242.3167
3000	0.1212	429.5358	2881.4824	1.6761	34.5666	57.859	7.232	1026.3900
4000	0.1616	373.7652	2655.5688	1.5906	34.614	57.78	7.223	763.8854
5000	0.2020	325.4127	2080.4141	1.5114	34.571	57.852	7.231	852.0554
6000	0.2424	283.6897	1739.0519	1.4385	34.537	57.909	7.239	678.3768
7000	0.2828	258.3082	1566.5100	1.3779	34.3495	58.225	7.278	772.5022
8000	0.3232	234.4317	1380.9691	1.3265	34.584	57.83	7.229	688.6912
9000	0.3636	215.7906	1306.3361	1.2753	34.5544	57.88	7.235	583.5945
10000	0.4040	198.0003	1188.5668	1.2276	34.6027	57.799	7.225	564.3576
11000	0.4444	181.9449	1205.6156	1.1866	34.6424	57.733	7.217	871.1540
12000	0.4848	168.2595	991.4434	1.1379	34.5542	57.88	7.235	533.5765
13000	0.5253	158.1617	921.2160	1.1068	34.2816	58.34	7.293	551.3958
14000	0.5657	150.2365	856.6271	1.0792	34.3516	58.221	7.278	696.3668
15000	0.6061	144.5713	878.8344	1.0595	34.5245	57.93	7.241	519.5146
16000	0.6465	139.2169	769.8428	1.0385	34.5444	57.896	7.237	485.0522
17000	0.6869	137.1886	707.7368	1.0232	34.5043	57.964	7.245	714.2610
18000	0.7273	133.8944	712.3927	1.0136	34.5983	57.806	7.226	653.7423
19000	0.7677	130.7503	663.3331	1.0010	34.4715	58.019	7.252	561.5013
20000	0.8081	129.0055	645.0233	0.9909	34.3899	58.157	7.27	516.6090
21000	0.8485	127.4918	689.9504	0.9901	34.443	58.067	7.258	434.3951
22000	0.8889	123.1615	682.8845	0.9756	34.5625	57.866	7.233	450.5237
23000	0.9293	123.7943	689.0751	0.9766	35.1251	56.939	7.117	593.8933
24000	0.9697	121.1035	699.7977	0.9661	34.5693	57.855	7.232	1115.5807
24750	1.0	123.4295	640.9431	0.9586	34.5071	57.959	7.245	447.3463

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0

lapp0
/

distily_bench_gpt2_activation_loss