HiTZ/mt-hitz-ca-eu · Hugging Face

Hitz Center’s Catalan-Basque machine translation model

Model description

This model was trained from scratch using Marian NMT on a combination of Catalan-Basque datasets totalling 11,224,976 sentence pairs. 1,531,980 sentence pairs were parallel data collected from the web while the remaining 9,692,996 sentence pairs were parallel synthetic data created using the ES-CA translator from Aina project. The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.

Developed by: HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
Model type: traslation
Source Language: Catalan
Target Language: Basque
License: apache-2.0

Intended uses and limitations

You can use this model for machine translation from Catalan to Basque.

At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import MarianMTModel, MarianTokenizer
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM

src_text = ["això és una prova"]

model_name = "HiTZ/mt-hitz-ca-eu"
tokenizer = MarianTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=T
rue))
print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])`

The recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1

Training Details

Training Data

The Catalan-Basque data collected from the web was a combination of the following datasets:

Dataset	Sentences before cleaning
CCMatrix v1	1,083,677
XLENT	219,566
WikiMatrix	77,233
GNOME	14,828
KDE4	93,787
QED	6,554
TED2020 v1	4,469
OpenSubtitles	29,114
Ubuntu	2,752
Total	1.531.980

The 9,692,996 sentence pairs of synthetic parallel data were created by translating a compendium of ES-EU parallel corpora into Catalan using the ES-CA translator from the Aina project.

Training Procedure

Preprocessing

After concatenation, all datasets are cleaned and deduplicated using bifixer (Ramírez-Sánchez et al., 2020) for identifying repetions and cleaning encoding problems and LaBSE embeddings to filter missaligned sentences. Any sentence pairs with a LaBSE similarity score of less than 0.5 is removed. The filtered corpus is composed of 10,582,279 parallel sentences.

Tokenization

All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.

Evaluation

Variable and metrics

We use the BLEU and TER scores for evaluation on test sets: Flores-200, TaCon and NTREX

Evaluation results

Below are the evaluation results on the machine translation from Catalan to Basque compared to Google Translate, NLLB 200 3.3B and NLLB-200's distilled 1.3B variant:

####BLEU scores

Test set	Google Translate	NLLB 1.3B	NLLB 3.3	mt-hitz-ca-eu
Flores 200 devtest	18.0	13.2	12.9	17.2
TaCON	13.2	11.8	11.2	14.0
NTREX	13.8	11.1	10.5	14.0
Average	15.0	12.0	11.5	15.1

####TER scores

Test set	Google Translate	NLLB 1.3B	NLLB 3.3	mt-hitz-ca-eu
Flores 200 devtest	63.1	76.5	70.8	65.0
TaCON	65.0	76.5	72.1	48.4
NTREX	69.4	79.4	75.5	69.7
Average	65.8	77.5	72.8	61.0

Additional information

Author

HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)

Contact information

For further information, send an email to [email protected]

Licensing information

This work is licensed under a Apache License, Version 2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334

Disclaimer

Click to expand

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. In no event shall the owner and creator of the models (HiTZ Research Center) be liable for any results arising from the use made by third parties of these models.