Canine for Language Identification
Canine model trained on WiLI-2018 dataset to identify the language of a text.
Preprocessing
- 10% of train data stratified sampled as validation set
- max sequence length: 512
Hyperparameters
- epochs: 4
- learning-rate: 3e-5
- batch size: 16
- gradient_accumulation: 4
- optimizer: AdamW with default settings
Test Results
- Accuracy: 94,92%
- Macro F1-score: 94,91%
Inference
Dictionary to return English names for a label id:
import datasets
import pycountry
def int_to_lang():
dataset = datasets.load_dataset('wili_2018')
# names for languages not in iso-639-3 from wikipedia
non_iso_languages = {'roa-tara': 'Tarantino', 'zh-yue': 'Cantonese', 'map-bms': 'Banyumasan',
'nds-nl': 'Dutch Low Saxon', 'be-tarask': 'Belarusian'}
# create dictionary from data set labels to language names
lab_to_lang = {}
for i, lang in enumerate(dataset['train'].features['label'].names):
full_lang = pycountry.languages.get(alpha_3=lang)
if full_lang:
lab_to_lang[i] = full_lang.name
else:
lab_to_lang[i] = non_iso_languages[lang]
return lab_to_lang
Credit to
@article{clark-etal-2022-canine,
title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
author = "Clark, Jonathan H. and
Garrette, Dan and
Turc, Iulia and
Wieting, John",
journal = "Transactions of the Association for Computational Linguistics",
volume = "10",
year = "2022",
address = "Cambridge, MA",
publisher = "MIT Press",
url = "https://aclanthology.org/2022.tacl-1.5",
doi = "10.1162/tacl_a_00448",
pages = "73--91",
abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.",
}
@dataset{thoma_martin_2018_841984,
author = {Thoma, Martin},
title = {{WiLI-2018 - Wikipedia Language Identification
database}},
month = jan,
year = 2018,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.841984},
url = {https://doi.org/10.5281/zenodo.841984}
}
- Downloads last month
- 9
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.