Model Card for TRIBBLE - Translating Iberian Languages Based on Limited E-resources
Model Description
TRIBBLE is a machine translation model specifically fine-tuned for low-resource Iberian languages as part of the WMT24 Shared Task. It translates from Spanish (spa_Latn) to Aragonese (arg_Latn), Asturian (ast_Latn), and Aranese (arn_Latn), providing an essential tool for these endangered languages within the Romance language family.
The model builds on distilled NLLB-200 with 600M parameters, integrating additional tokens for Aragonese and Aranese to extend the multilingual translation capabilities of the original NLLB-200 model.
Model Details
- Architecture: Distilled NLLB-200 (600M parameters)
- Training Data: Processed subsets of OPUS and PILAR corpora, alongside bilingual and monolingual data sources.
- Control Tokens:
arg_Latn
for Aragonese andarn_Latn
for Aranese, initialized withspa_Latn
andoci_Latn
embeddings, respectively, based on linguistic proximity. - Optimization: Fine-tuned with Adafactor optimizer and custom data processing pipeline.
Intended Use
This model is intended for translation tasks involving low-resource Iberian languages:
- Translating from Spanish to Aragonese, Asturian, and Aranese.
- Applications in cultural preservation, language research, and digital inclusion for endangered languages.
Evaluation
TRIBBLE was evaluated using BLEU and chrF metrics on the WMT24 devtest set:
Language Direction | Baseline (Apertium) | TRIBBLE (Constrained) |
---|---|---|
Spanish → Aragonese (BLEU) | 61.1 | 49.2 |
Spanish → Aragonese (chrF) | 79.3 | 73.6 |
Spanish → Aranese (BLEU) | 28.8 | 23.9 |
Spanish → Aranese (chrF) | 49.4 | 46.1 |
Spanish → Asturian (BLEU) | 17.0 | 17.9 |
Spanish → Asturian (chrF) | 50.8 | 50.5 |
While Apertium generally outperformed TRIBBLE, the model achieved comparable BLEU scores for Asturian. The constrained setting highlights TRIBBLE's potential for low-resource translation with efficient data use.
Citation
If you use TRIBBLE in your work, please cite:
@InProceedings{kuzmin-EtAl:2024:WMT,
author = {Kuzmin, Igor and Przybyła, Piotr and McGill, Euan and Saggion, Horacio},
title = {TRIBBLE - TRanslating IBerian languages Based on Limited E-resources},
booktitle = {Proceedings of the Ninth Conference on Machine Translation},
month = {November},
year = {2024},
address = {Miami, Florida, USA},
publisher = {Association for Computational Linguistics},
pages = {955--959},
abstract = {In this short overview paper, we describe our system submission for the language pairs Spanish to Aragonese (spa-arg), Spanish to Aranese (spa-arn), and Spanish to Asturian (spa-ast). We train a unified model for all language pairs in the constrained scenario. In addition, we add two language control tokens for Aragonese and Aranese Occitan, as there is already one present for Asturian. We take the distilled NLLB-200 model with 600M parameters and extend special tokens with 2 tokens that denote target languages (arn\_Latn, arg\_Latn) because Asturian was already presented in NLLB-200 model. We adapt the model by training on a special regime of data augmentation with both monolingual and bilingual training data for the language pairs in this challenge.},
url = {https://www2.statmt.org/wmt24/pdf/2024.wmt-1.94.pdf}
}
- Downloads last month
- 9
Model tree for igorktech/tribble-600m
Base model
facebook/nllb-200-distilled-600M