metadata
language:
- bg
- cs
- da
- de
- el
- en
- es
- et
- fi
- fr
- ga
- hr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- sv
- uk
- multilingual
license: mit
EuroGPT2
NOTE: THIS IS THE ORIGINAL MEGATRON-DEEPSPEED CHECKPOINT INCLUDING OPTIMIZER STATES
A GPT2 language model for European languages (EU-24 + Ukrainian). The model follows the original architecture as OpenAI's GPT2 apart from using rotary instead of learned positional embeddigs.
Model settings
- parameters: 124M
- number of layers: 12
- hidden size: 768
- number of heads: 12
- sequence length: 1024
- batch size: 168
- test PPL after training: 23.6 (steps: 436,940)
Training data
- Wikimedia dumps (Wikipedia, Wikinews, Wikibooks, Wikisource, Wikivoyage; 20230301)
- EUR-Lex
- OSCAR 2023.01
- Tokens: 75,167,662,080
Languages
Included languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, and Ukrainian.
Language | Ratio |
---|---|
bg | 5,92% |
cs | 4,77% |
da | 2,19% |
de | 7,36% |
el | 8,60% |
en | 10,11% |
es | 6,57% |
et | 1,67% |
fi | 2,70% |
fr | 7,18% |
ga | 0,25% |
hr | 1,09% |
hu | 6,38% |
it | 5,80% |
lt | 2,01% |
lv | 1,76% |
mt | 1,49% |
nl | 5,20% |
pl | 4,82% |
pt | 4,64% |
ro | 2,93% |
sk | 2,03% |
sl | 1,54% |
sv | 3,00% |
License
MIT