Edit model card

distily_bench_gpt2_activation_loss

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 215.7906
  • eval_frwikippl: 1306.3361
  • eval_zhwikippl: 583.5945
  • eval_loss: 1.2753
  • eval_runtime: 34.5544
  • eval_samples_per_second: 57.88
  • eval_steps_per_second: 7.235

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: MultiObjective(logits_weight=1, logits_loss_fn=(fn:kl_divergence_loss()), hs_weight=0.2, hs_loss_fn=(fn:soft_cross_entropy_loss()), attn_weight=0, attn_loss_fn=(fn:soft_mse_loss()))
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0904 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 30.2086 57.2728 18.1784
0 0 55317.8945 54673.0039 5.9451 34.3869 58.162 7.27 59699.7266
1000 0.0404 730.1941 4645.5654 1.9777 34.4565 58.044 7.256 11835.6895
2000 0.0808 512.3346 3066.7913 1.7886 34.514 57.947 7.243 2242.3167
3000 0.1212 429.5358 2881.4824 1.6761 34.5666 57.859 7.232 1026.3900
4000 0.1616 373.7652 2655.5688 1.5906 34.614 57.78 7.223 763.8854
5000 0.2020 325.4127 2080.4141 1.5114 34.571 57.852 7.231 852.0554
6000 0.2424 283.6897 1739.0519 1.4385 34.537 57.909 7.239 678.3768
7000 0.2828 258.3082 1566.5100 1.3779 34.3495 58.225 7.278 772.5022
8000 0.3232 234.4317 1380.9691 1.3265 34.584 57.83 7.229 688.6912
9000 0.3636 215.7906 1306.3361 1.2753 34.5544 57.88 7.235 583.5945
10000 0.4040 198.0003 1188.5668 1.2276 34.6027 57.799 7.225 564.3576
11000 0.4444 181.9449 1205.6156 1.1866 34.6424 57.733 7.217 871.1540
12000 0.4848 168.2595 991.4434 1.1379 34.5542 57.88 7.235 533.5765
13000 0.5253 158.1617 921.2160 1.1068 34.2816 58.34 7.293 551.3958
14000 0.5657 150.2365 856.6271 1.0792 34.3516 58.221 7.278 696.3668
15000 0.6061 144.5713 878.8344 1.0595 34.5245 57.93 7.241 519.5146
16000 0.6465 139.2169 769.8428 1.0385 34.5444 57.896 7.237 485.0522
17000 0.6869 137.1886 707.7368 1.0232 34.5043 57.964 7.245 714.2610
18000 0.7273 133.8944 712.3927 1.0136 34.5983 57.806 7.226 653.7423
19000 0.7677 130.7503 663.3331 1.0010 34.4715 58.019 7.252 561.5013
20000 0.8081 129.0055 645.0233 0.9909 34.3899 58.157 7.27 516.6090
21000 0.8485 127.4918 689.9504 0.9901 34.443 58.067 7.258 434.3951
22000 0.8889 123.1615 682.8845 0.9756 34.5625 57.866 7.233 450.5237
23000 0.9293 123.7943 689.0751 0.9766 35.1251 56.939 7.117 593.8933
24000 0.9697 121.1035 699.7977 0.9661 34.5693 57.855 7.232 1115.5807
24750 1.0 123.4295 640.9431 0.9586 34.5071 57.959 7.245 447.3463

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.20.0
Downloads last month
9
Safetensors
Model size
124M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_bench_gpt2_activation_loss

Quantized
(50)
this model