Update config.json
Browse files- README.md +62 -1
- config.json +2 -2
- test.py +7 -0
README.md
CHANGED
@@ -128,6 +128,12 @@ license: cc-by-4.0
|
|
128 |
|
129 |
1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.
|
130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
## Demo: How to use in HuggingFace Transformers Pipeline
|
132 |
|
133 |
Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```
|
@@ -145,13 +151,68 @@ print(res)
|
|
145 |
Outputs:
|
146 |
|
147 |
```python
|
148 |
-
[{'label': '
|
149 |
```
|
150 |
|
151 |
## Training data
|
152 |
|
153 |
[MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.
|
154 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
155 |
|
156 |
## Evaluation results
|
157 |
|
|
|
128 |
|
129 |
1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.
|
130 |
|
131 |
+
## Model
|
132 |
+
|
133 |
+
XLM-Roberta : [https://huggingface.co/xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
|
134 |
+
|
135 |
+
Paper : [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/pdf/1911.02116.pdf)
|
136 |
+
|
137 |
## Demo: How to use in HuggingFace Transformers Pipeline
|
138 |
|
139 |
Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```
|
|
|
151 |
Outputs:
|
152 |
|
153 |
```python
|
154 |
+
[{'label': 'he-IL', 'score': 0.9998375177383423}]
|
155 |
```
|
156 |
|
157 |
## Training data
|
158 |
|
159 |
[MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.
|
160 |
|
161 |
+
### Languages
|
162 |
+
|
163 |
+
Thee model is capable of distinguish 51 languages :
|
164 |
+
|
165 |
+
- `Afrikaans - South Africa (af-ZA)`
|
166 |
+
- `Amharic - Ethiopia (am-ET)`
|
167 |
+
- `Arabic - Saudi Arabia (ar-SA)`
|
168 |
+
- `Azeri - Azerbaijan (az-AZ)`
|
169 |
+
- `Bengali - Bangladesh (bn-BD)`
|
170 |
+
- `Chinese - China (zh-CN)`
|
171 |
+
- `Chinese - Taiwan (zh-TW)`
|
172 |
+
- `Danish - Denmark (da-DK)`
|
173 |
+
- `German - Germany (de-DE)`
|
174 |
+
- `Greek - Greece (el-GR)`
|
175 |
+
- `English - United States (en-US)`
|
176 |
+
- `Spanish - Spain (es-ES)`
|
177 |
+
- `Farsi - Iran (fa-IR)`
|
178 |
+
- `Finnish - Finland (fi-FI)`
|
179 |
+
- `French - France (fr-FR)`
|
180 |
+
- `Hebrew - Israel (he-IL)`
|
181 |
+
- `Hungarian - Hungary (hu-HU)`
|
182 |
+
- `Armenian - Armenia (hy-AM)`
|
183 |
+
- `Indonesian - Indonesia (id-ID)`
|
184 |
+
- `Icelandic - Iceland (is-IS)`
|
185 |
+
- `Italian - Italy (it-IT)`
|
186 |
+
- `Japanese - Japan (ja-JP)`
|
187 |
+
- `Javanese - Indonesia (jv-ID)`
|
188 |
+
- `Georgian - Georgia (ka-GE)`
|
189 |
+
- `Khmer - Cambodia (km-KH)`
|
190 |
+
- `Korean - Korea (ko-KR)`
|
191 |
+
- `Latvian - Latvia (lv-LV)`
|
192 |
+
- `Mongolian - Mongolia (mn-MN)`
|
193 |
+
- `Malay - Malaysia (ms-MY)`
|
194 |
+
- `Burmese - Myanmar (my-MM)`
|
195 |
+
- `Norwegian - Norway (nb-NO)`
|
196 |
+
- `Dutch - Netherlands (nl-NL)`
|
197 |
+
- `Polish - Poland (pl-PL)`
|
198 |
+
- `Portuguese - Portugal (pt-PT)`
|
199 |
+
- `Romanian - Romania (ro-RO)`
|
200 |
+
- `Russian - Russia (ru-RU)`
|
201 |
+
- `Slovanian - Slovania (sl-SL)`
|
202 |
+
- `Albanian - Albania (sq-AL)`
|
203 |
+
- `Swedish - Sweden (sv-SE)`
|
204 |
+
- `Swahili - Kenya (sw-KE)`
|
205 |
+
- `Hindi - India (hi-IN)`
|
206 |
+
- `Kannada - India (kn-IN)`
|
207 |
+
- `Malayalam - India (ml-IN)`
|
208 |
+
- `Tamil - India (ta-IN)`
|
209 |
+
- `Telugu - India (te-IN)`
|
210 |
+
- `Thai - Thailand (th-TH)`
|
211 |
+
- `Tagalog - Philippines (tl-PH)`
|
212 |
+
- `Turkish - Turkey (tr-TR)`
|
213 |
+
- `Urdu - Pakistan (ur-PK)`
|
214 |
+
- `Vietnamese - Vietnam (vi-VN)`
|
215 |
+
- `Welsh - United Kingdom (cy-GB)`
|
216 |
|
217 |
## Evaluation results
|
218 |
|
config.json
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:cc8e70262f68a7555aed1c9836f1226de164e611212f23703995b6515127935d
|
3 |
+
size 2626
|
test.py
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
|
2 |
+
model_name = 'qanastek/51-languages-classifier'
|
3 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
4 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
5 |
+
classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)
|
6 |
+
res = classifier("פרק הבא בפודקאסט בבקשה")
|
7 |
+
print(res)
|