metadata

language:
  - ace
  - af
  - als
  - am
  - an
  - ang
  - ar
  - arz
  - as
  - ast
  - av
  - ay
  - az
  - azb
  - ba
  - bar
  - bcl
  - be
  - bg
  - bho
  - bjn
  - bn
  - bo
  - bpy
  - br
  - bs
  - bxr
  - ca
  - cbk
  - cdo
  - ce
  - ceb
  - chr
  - ckb
  - co
  - crh
  - cs
  - csb
  - cv
  - cy
  - da
  - de
  - diq
  - dsb
  - dty
  - dv
  - egl
  - el
  - en
  - eo
  - es
  - et
  - eu
  - ext
  - fa
  - fi
  - fo
  - fr
  - frp
  - fur
  - fy
  - ga
  - gag
  - gd
  - gl
  - glk
  - gn
  - gu
  - gv
  - ha
  - hak
  - he
  - hi
  - hif
  - hr
  - hsb
  - ht
  - hu
  - hy
  - ia
  - id
  - ie
  - ig
  - ilo
  - io
  - is
  - it
  - ja
  - jam
  - jbo
  - jv
  - ka
  - kaa
  - kab
  - kbd
  - kk
  - km
  - kn
  - ko
  - koi
  - kok
  - krc
  - ksh
  - ku
  - kv
  - kw
  - ky
  - la
  - lad
  - lb
  - lez
  - lg
  - li
  - lij
  - lmo
  - ln
  - lo
  - lrc
  - lt
  - ltg
  - lv
  - lzh
  - mai
  - map
  - mdf
  - mg
  - mhr
  - mi
  - min
  - mk
  - ml
  - mn
  - mr
  - mrj
  - ms
  - mt
  - mwl
  - my
  - myv
  - mzn
  - nan
  - nap
  - nb
  - nci
  - nds
  - ne
  - new
  - nl
  - nn
  - nrm
  - nso
  - nv
  - oc
  - olo
  - om
  - or
  - os
  - pa
  - pag
  - pam
  - pap
  - pcd
  - pdc
  - pfl
  - pl
  - pnb
  - ps
  - pt
  - qu
  - rm
  - ro
  - roa
  - ru
  - rue
  - rup
  - rw
  - sa
  - sah
  - sc
  - scn
  - sco
  - sd
  - sgs
  - sh
  - si
  - sk
  - sl
  - sme
  - sn
  - so
  - sq
  - sr
  - srn
  - stq
  - su
  - sv
  - sw
  - szl
  - ta
  - tcy
  - te
  - tet
  - tg
  - th
  - tk
  - tl
  - tn
  - to
  - tr
  - tt
  - tyv
  - udm
  - ug
  - uk
  - ur
  - uz
  - vec
  - vep
  - vi
  - vls
  - vo
  - vro
  - wa
  - war
  - wo
  - wuu
  - xh
  - xmf
  - yi
  - yo
  - zea
  - zh
  - multilingual
license: apache-2.0
tags:
  - Language Identification
datasets:
  - wili_2018
metrics:
  - accuracy
  - macro F1-score
language_bcp47:
  - be-tarask
  - map-bms
  - nds-nl
  - roa-tara
  - zh-yue

Canine for Language Identification

Canine model trained on WiLI-2018 dataset to identify the language of a text.

Preprocessing

10% of train data stratified sampled as validation set
max sequence length: 512

Hyperparameters

epochs: 4
learning-rate: 3e-5
batch size: 16
gradient_accumulation: 4
optimizer: AdamW with default settings

Test Results

Accuracy: 94,92%
Macro F1-score: 94,91%

Inference

Dictionary to return English names for a label id:

import datasets
import pycountry
def int_to_lang():
    dataset = datasets.load_dataset('wili_2018')
    # names for languages not in iso-639-3 from wikipedia
    non_iso_languages = {'roa-tara': 'Tarantino', 'zh-yue': 'Cantonese', 'map-bms': 'Banyumasan',
                         'nds-nl': 'Dutch Low Saxon', 'be-tarask': 'Belarusian'}
    # create dictionary from data set labels to language names
    lab_to_lang = {}
    for i, lang in enumerate(dataset['train'].features['label'].names):
        full_lang = pycountry.languages.get(alpha_3=lang)
        if full_lang:
            lab_to_lang[i] = full_lang.name
        else:
            lab_to_lang[i] = non_iso_languages[lang]
    return lab_to_lang

Credit to

@article{clark-etal-2022-canine,
    title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
    author = "Clark, Jonathan H.  and
      Garrette, Dan  and
      Turc, Iulia  and
      Wieting, John",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "10",
    year = "2022",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/2022.tacl-1.5",
    doi = "10.1162/tacl_a_00448",
    pages = "73--91",
    abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.",
}
@dataset{thoma_martin_2018_841984,
  author       = {Thoma, Martin},
  title        = {{WiLI-2018 - Wikipedia Language Identification 
                   database}},
  month        = jan,
  year         = 2018,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.841984},
  url          = {https://doi.org/10.5281/zenodo.841984}
}