|
--- |
|
language: nl |
|
widget: |
|
- text: "In het jaar 2030 zullen we" |
|
- text: "Toen ik gisteren volledig in de ban was van" |
|
- text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul" |
|
- text: "In Israël was een strenge lockdown" |
|
tags: |
|
- gpt2-medium |
|
- gpt2 |
|
pipeline_tag: text-generation |
|
datasets: |
|
- yhavinga/mc4_nl_cleaned |
|
--- |
|
# GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱 |
|
|
|
A GPT2 medium sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.2 on cleaned Dutch mC4. |
|
|
|
## Tokenizer |
|
|
|
* Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface |
|
Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling). |
|
|
|
## Dataset |
|
|
|
This model was trained on of the `full` configuration (33B tokens) of |
|
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned), |
|
which is the original mC4, except |
|
|
|
* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed |
|
* Sentences with less than 3 words are removed |
|
* Sentences with a word of more than 1000 characters are removed |
|
* Documents with less than 5 sentences are removed |
|
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", |
|
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed. |
|
|
|
## Training details |
|
|
|
* Trained for 320K of 520K steps (61%, 20B tokens) |
|
* Block size: 512 |
|
* Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98 |
|
* Warmup steps: 5000 |
|
* Weight decay: 0.01 |
|
|
|
## Acknowledgements |
|
|
|
This project would not have been possible without compute generously provided by Google through the |
|
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also |
|
instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM, |
|
and getting an idea what sensible hyper-parameters are for training gpt2 from scratch. |
|
|
|
* [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp) |
|
* [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) |
|
* [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian) |
|
* [language model training examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling) |
|
|