distily_bench_obj_cross

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 148.8680
eval_frwikippl: 21987.7637
eval_zhwikippl: 181662.0469
eval_tinystoriesppl: 12.2941
eval_loss: 25.4402
eval_runtime: 66.3462
eval_samples_per_second: 75.362
eval_steps_per_second: 9.42

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=kl, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.004
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2666 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	8167.3613	48488.5742	38.4688	65.8686	75.909	9.489	3345.3254	73944.1484
3000	0.0485	145.5666	22094.8828	25.4406	65.8501	75.93	9.491	11.9420	179685.7344
6000	0.0970	146.8635	22376.7676	25.4394	66.4135	75.286	9.411	11.9667	183024.1719
9000	0.1455	148.8680	21987.7637	25.4402	66.3462	75.362	9.42	12.2941	181662.0469
12000	0.1939	151.0636	22504.7676	25.4400	66.2246	75.501	9.438	12.5052	181759.0938
15000	0.2424	146.5339	22604.8535	25.4392	66.1192	75.621	9.453	11.8540	189888.4375
18000	0.2909	147.3192	22481.0215	25.4400	66.2457	75.477	9.435	12.0058	183905.2969
21000	0.3394	150.9525	22555.5625	25.4390	66.2661	75.453	9.432	12.4310	188575.7344
24000	0.3879	149.8920	22155.6523	25.4404	66.2363	75.487	9.436	12.4593	177493.9531
27000	0.4364	147.1653	22531.7402	25.4398	66.3823	75.321	9.415	11.9514	183905.2969
30000	0.4848	150.4855	22580.9805	25.4400	66.2281	75.497	9.437	12.4172	183513.1875
33000	0.5333	145.7359	22307.5195	25.4400	66.4448	75.25	9.406	11.9159	180165.8438
36000	0.5818	148.7297	22495.2617	25.4396	66.2715	75.447	9.431	12.1426	186574.0156
39000	0.6303	147.5820	22807.9492	25.4406	66.6342	75.037	9.38	11.9944	187372.1406
42000	0.6788	150.2292	22193.125	25.4402	66.5873	75.089	9.386	12.5202	182050.1875
45000	0.7273	146.7725	22207.2051	25.4400	66.1476	75.589	9.449	11.9890	181468.2812
48000	0.7758	146.3014	22194.6914	25.4398	66.4166	75.282	9.41	11.9746	177588.7812
51000	0.8242	148.6375	22533.3301	25.4402	66.2612	75.459	9.432	12.1471	186275.5156
54000	0.8727	147.6220	22394.1035	25.4404	66.4085	75.292	9.411	12.1140	185581.1406
57000	0.9212	148.8161	22679.8047	25.4400	66.3328	75.377	9.422	12.1230	187872.7812
60000	0.9697	146.8180	22345.2695	25.4392	66.5261	75.158	9.395	12.0317	181371.5625
61875	1.0	149.0526	22099.5410	25.4400	66.498	75.19	9.399	12.3048	181371.5625

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0

lapp0
/

distily_bench_obj_cross