Shijie Wu
commited on
Commit
•
8557e84
1
Parent(s):
25fb6b2
Minor Fix
Browse files
README.md
CHANGED
@@ -17,6 +17,12 @@ datasets:
|
|
17 |
|
18 |
# An English-Arabic Bilingual Encoder
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
`roberta-large-eng-ara-128k` is an English–Arabic bilingual encoders of 24-layer Transformers (d\_model= 1024), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Suárez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English–Arabic joint vocabulary using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
|
21 |
|
22 |
## Pretraining Detail
|
|
|
17 |
|
18 |
# An English-Arabic Bilingual Encoder
|
19 |
|
20 |
+
```
|
21 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
22 |
+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
|
23 |
+
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
|
24 |
+
```
|
25 |
+
|
26 |
`roberta-large-eng-ara-128k` is an English–Arabic bilingual encoders of 24-layer Transformers (d\_model= 1024), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Suárez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English–Arabic joint vocabulary using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
|
27 |
|
28 |
## Pretraining Detail
|