distilHerBERT
distilHerBERT-base is a BERT-based Language Model trained on Polish subset of cc100 dataset using Masked Language Modelling (MLM) and distillation procedure from model HerBERT with dynamic masking of whole words. We provide one of the models (S4) described in the report from final project on the subject of (Deep) Natural Language Processing, which was carried out at MIMUW in 2021/2022: Distillation_of_HerBERT.
The model was trained using fp16 and the data parallelism method (ZeRO Stage 2), using the deep learning optimization library - DeepSpeed.
Model training and experiments were conducted with transformers in version 4.20.1.
Tokenizer
The training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer
) with
a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
We kindly encourage you to use the Fast
version of the tokenizer, namely HerbertTokenizerFast
.
Usage
Example code:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("BartekK/distilHerBERT-base-cased")
model = AutoModelForMaskedLM.from_pretrained("BartekK/distilHerBERT-base-cased")
output = model(
**tokenizer.batch_encode_plus(
[
(
"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
)
],
padding='longest',
add_special_tokens=True,
return_tensors='pt'
)
)
Acknowledgements
We want to thank
Spyridon Mouselinos - for suggesting literature to help with the project
and
Piotr Rybak - for sharing information on training the HerBERT models
Authors
The model was trained by:
Bartłomiej Krzepkowski,
Dominika Bankiewicz,
Rafał Michaluk,
Jacek Ciszewski.
If you have questions please contact me: [email protected]
The code can be found here: distilHerBERT-base repo.
- Downloads last month
- 5