Update README.md
Browse files
README.md
CHANGED
@@ -1,8 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
This is a **RoBERTa-base** model trained from scratch in Spanish.
|
2 |
|
3 |
-
The training dataset is mc4
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
5 |
-
|
6 |
|
7 |
-
(
|
8 |
-
(
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: es
|
3 |
+
license: CC-BY 4.0
|
4 |
+
tags:
|
5 |
+
- spanish
|
6 |
+
- roberta
|
7 |
+
pipeline_tag: fill-mask
|
8 |
+
widget:
|
9 |
+
- text: "Fui a la librería a comprar un <mask>."
|
10 |
+
---
|
11 |
+
|
12 |
This is a **RoBERTa-base** model trained from scratch in Spanish.
|
13 |
|
14 |
+
The training dataset is [mc4](https://huggingface.co/datasets/bertin-project/mc4-es-sampled ) subsampling documents to a total of about 50 million examples. Sampling is biased towards average perplexity values (using a Gaussian function), discarding more often documents with very large values (poor quality) of very small values (short, repetitive texts).
|
15 |
+
|
16 |
+
This model takes the one using [sequence length 128](https://huggingface.co/bertin-project/bertin-base-gaussian) and trains during 25.000 steps using sequence length 512.
|
17 |
+
|
18 |
+
This is part of the
|
19 |
+
[Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organised by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
20 |
+
|
21 |
|
22 |
+
## Team members
|
23 |
|
24 |
+
- Eduardo González ([edugp](https://huggingface.co/edugp))
|
25 |
+
- Javier de la Rosa ([versae](https://huggingface.co/versae))
|
26 |
+
- Manu Romero ([mrm8488](https://huggingface.co/))
|
27 |
+
- María Grandury ([mariagrandury](https://huggingface.co/))
|
28 |
+
- Pablo González de Prado ([Pablogps](https://huggingface.co/Pablogps))
|
29 |
+
- Paulo Villegas ([paulo](https://huggingface.co/paulo))
|