roberta-base distilled into tinyroberta

Overview

Language model: roberta-base
Language: English
Training data: The PILE
Infrastructure: 4x Tesla v100

Hyperparameters

batch_size = 96
n_epochs = 4
max_seq_len = 384
learning_rate = 1e-4
lr_schedule = LinearWarmup
warmup_proportion = 0.2
teacher = "deepset/roberta-base"

Distillation

This model was distilled using the TinyBERT approach described in this paper and implemented in haystack. We have performed intermediate layer distillation with roberta-base as the teacher which resulted in deepset/tinyroberta-6l-768d. This model has not been distilled for any specific task. If you are interested in using distillation to improve its performance on a downstream task, you can take advantage of haystack's new distillation functionality. You can also check out deepset/tinyroberta-squad2 for a model that is already distilled on an extractive QA downstream task.