File size: 3,005 Bytes

d9e584d

---
language: 
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
---
# Model Description

The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data. This model is [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) fine-tuned with the [conll2003](https://huggingface.co/datasets/conll2003) dataset in English.

- **Developed by:** See [associated paper](https://arxiv.org/abs/1911.02116)
- **Model type:** Multi-lingual language model
- **Language(s) (NLP) or Countries (images):** XLM-RoBERTa is a multilingual model trained on 100 different languages; see [GitHub Repo](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr) for full list; model is fine-tuned on a dataset in English
- **Related Models:** [RoBERTa](https://huggingface.co/roberta-base), [XLM](https://huggingface.co/docs/transformers/model_doc/xlm)
    - **Parent Model:** [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large)

Hong Kong Legal Information Institute [HKILL](https://www.hklii.hk/eng/) is a free, independent, non-profit document database providing the public with legal information relating to Hong Kong.  We finetune the XLM-RoBERTa on the HKILL datasets. It contains docments 

# Uses

The model is a pretrained-finetuned language model. The model can be used for document classification, Named Entity Recognition (NER), especially on legal domain.
```python
>>> from transformers import pipeline,AutoTokenizer,AutoModelForTokenClassification
>>> tokenizer = AutoTokenizer.from_pretrained("hklegal-xlm-r-large")
>>> model = AutoModelForTokenClassification.from_pretrained("hklegal-xlm-r-large")
>>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
>>> classifier("Alya told Jasmine that Andrew could pay with cash..")
```

# Citation

**BibTeX:**

```bibtex
@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}
```