opus-mt-tc-bible-big-mul-mul

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from Multiple languages (mul) to Multiple languages (mul). Note that many of the listed languages will not be well supported by the model as the training data is very limited for the majority of the languages. Translation performance varies a lot and for a large number of language pairs it will not work at all.

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-08-17
License: Apache-2.0
Language(s):
- Source Language(s): aar abk ace ach acm ady afb afh afr aii ajp aka akl aln alt amh ami amu ang anp aoz apc ara arc arg arq arz asm ast atj ava avk awa ayl aze azz bak bal bam ban bar bas bcl bel bem ben bho bik bis bod bom bos bpy bre brx bua bug bul bvy byn bzt cak cat cay cbk ceb ces cha che chg chm chq chr chu chv chy cjk cjp cjy ckb cmn cnh cni cnr cop cor cos cre crh crk crs csb cym dag dan deu dik din diq div dje djk dng dop drt dsb dtp dty dws dyu dzo efi egl ell emx eng enm epo est eus evn ewe ext fao fas fij fil fin fkv fon fra frm fro frp frr fry fuc ful fur gag gbm gcf gil gla gle glg glk glv gor gos got grc grn gsw guc guj guw hat hau haw hbo hbs heb her hif hil hin hmn hne hnj hoc hrv hrx hsb hsn hun hus hye hyw iba ibo ido igs iii ike iku ile ilo ina ind inh ipk isl ita ixl izh jaa jak jam jav jbo jdt jpa jpn kaa kab kac kal kam kan kas kat kau kaz kbd kbp kea kek kha khm kik kin kir kiu kjh kmb kmr knc koi kok kom kon kpv krc krl ksh kua kum kur kxi laa lad lah lao lat lav lbe ldn lez lfn lij lim lin lit liv lkt lld lmo lou lrc ltz lua lug luo lus lut luy lzz mad mag mah mai mal mam mar max mdf meh mfa mfe mgm mic mix mkd mlg mlt mnc mni mnr mnw moh mol mon mos mri mrj msa mvv mwl mww mya myv mzn nap nau nav nbl nch nde nds nep new ngt ngu nhg nhn nia niu nld nlv nnb nno nob nog non nov npi nqo nso nst nus nya oar oci ofs oji ood ori orm orv osp oss ota otk pag pai pal pam pan pap pau pcd pck pcm pdc pes pfl phn pih pli plt pms pmy pnt pol por pot ppk ppl prg prs pus quc qxq qya rap rhg rif rmy roh rom ron rue run rup rus sag sah san sat scn sco sdh ses sgs shi shn shs shy sin sjn skr slk slv sma sme sml smn smo sna snd som sot spa sqi srd srn srp ssw stq sun swa swc swe swg swh syc syl syr szl tah tam taq tat tcy tel tet tgk tgl tha thv tig tir tkl tlh tly tmh tmr tmw toi ton tpi tpw trs trv tsn tso tts tuk tum tur tvl twi tyj tyv tzl tzm udm uig ukr umb urd usp uzb vec ven vep vie vls vol vot vro wae wal war wln wol wuu xal xcl xho xmf yid yor yua yue zam zap zea zgh zha zlm zsm zul zza
- Target Language(s): aar abk ace ach acm ady afb afh_Latn afr aii_Syrc ajp aka akl_Latn aln alt amh ami ami_Latn amu_Latn ang_Latn anp aoz apc ara arc arg arq arz asm ast atj ava avk_Latn awa ayl aze_Cyrl aze_Latn azz azz_Latn bak bal bal_Latn bam_Latn ban bar bas bcl bel bem ben bho bik bis bod bom_Latn bos_Cyrl bos_Latn bpy bre brx bua bug bul bvy_Latn byn bzt_Latn cak cak_Latn cat cay cbk_Latn ceb ces cha che chg_Arab chg_Latn chm chq_Latn chr chu chv chy cjk cjk_Latn cjp_Latn cjy_Hans cjy_Hant ckb cmn cmn_Hans cmn_Hant cnh cnh_Latn cni_Latn cnr cnr_Latn cop cop_Copt cor cos cre cre_Latn crh crk crs csb csb_Latn cym dag_Latn dan deu dik din diq div dje djk djk_Latn dng dop_Latn drt_Latn dsb dtp dty dws_Latn dyu dzo efi egl ell emx_Latn eng enm_Latn epo est eus evn ewe ext fao fas fij fil fin fkv_Latn fon fra frm_Latn fro_Latn frp frr fry fuc ful fur gag gbm gcf gcf_Latn gil gla gle glg glk glv gor gos got got_Goth grc grc_Grek grn gsw guc guj guw guw_Latn hat hau_Latn haw hbo_Hebr hbs hbs_Cyrl hbs_Latn heb her hif_Latn hil hin hin_Latn hmn hne hnj hoc hoc_Wara hrv hrx_Latn hsb hsn hun hus hus_Latn hye hyw hyw_Armn hyw_Latn iba ibo ido_Latn igs_Latn iii ike_Latn iku_Latn ile ile_Latn ilo ina_Latn ind inh inh_Latn ipk isl ita ixl_Latn izh jaa jaa_Bopo jaa_Hira jaa_Kana jaa_Yiii jak_Latn jam jav jav_Java jbo jbo_Cyrl jbo_Latn jdt_Cyrl jpa_Hebr jpn kaa kab kac kal kam kan kas_Arab kas_Deva kat kau kaz kaz_Cyrl kbd kbp kbp_Cans kbp_Ethi kbp_Geor kbp_Grek kbp_Hang kbp_Latn kbp_Mlym kbp_Yiii kea kek kek_Latn kha khm kik kin kir_Cyrl kiu kjh kmb kmr knc koi kok kom kon kpv krc krl ksh kua kum kur_Arab kur_Cyrl kur_Latn kxi_Latn laa_Latn lad lad_Latn lah lao lat lat_Latn lav lbe ldn_Latn lez lfn_Cyrl lfn_Latn lij lim lin lit liv_Latn lkt lld_Latn lmo lou_Latn lrc ltz lua lug luo lus lut_Latn luy lzz_Geor lzz_Latn mad mag mah mai mal mam mam_Latn mar max_Latn mdf meh_Latn mfa mfe mgm_Latn mic mix mix_Latn mkd mlg mlt mnc_Mong mni mnr_Latn mnw moh mol mon mos mri mrj msa_Arab msa_Latn mvv_Latn mwl mww mya myv mzn nap nau nav nbl nch nde nds nep new ngt_Latn ngu ngu_Latn nhg_Latn nhn_Latn nia niu nld nlv_Latn nnb_Latn nno nob nog non nov_Latn npi nqo nso nst_Latn nus nya oar_Hebr oar_Syrc oci ofs_Latn oji_Latn ood_Latn ori orm orv_Cyrl osp_Latn oss ota_Arab ota_Latn ota_Rohg ota_Syrc ota_Thaa ota_Yezi otk otk_Orkh pag pai_Latn pal pam pan pan_Guru pap pau pcd pck_Latn pcm pdc pes pfl phn_Phnx pih pih_Latn pli plt pms pmy_Latn pnt_Grek pol por pot_Latn ppk_Latn ppl_Latn prg_Latn prs pus quc qxq_Arab qxq_Latn qya qya_Latn rap rhg_Latn rif_Latn rmy roh rom rom_Cyrl ron rue run rup rus sag sah san san_Deva sat sat_Latn scn sco sdh ses sgs shi_Latn shn shs_Latn shy_Latn sin sjn_Latn skr slk slv sma sme sml_Latn smn smo sna snd_Arab som sot spa sqi srd srn srp_Cyrl ssw stq sun swa swc swe swg swh syc_Syrc syl_Sylo syr szl tah tam taq tat tcy tel tet tgk_Cyrl tgk_Latn tgl tgl_Latn tgl_Tglg tha thv tig tir tkl tlh tlh_Latn tly_Latn tmh tmr_Hebr tmw_Latn toi toi_Latn ton tpi tpw_Latn trs trs_Latn trv tsn tso tts tuk tuk_Cyrl tuk_Latn tum tur tvl twi tyj_Latn tyv tzl tzl_Latn tzm_Latn tzm_Tfng udm uig uig_Arab uig_Cyrl uig_Latn ukr umb urd usp_Latn uzb_Cyrl uzb_Latn vec ven vep vie vls vol_Latn vot vot_Latn vro wae wal war wln wol wuu xal xcl_Armn xcl_Latn xho xmf yid yor yua yue_Hans yue_Hant zam zap zea zgh zha zlm_Arab zlm_Latn zsm_Arab zsm_Latn zul zza
Original Model: opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>aar<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

Also note that many of the listed languages will not be well supported by the model as the training data is very limited for the majority of the languages. Translation performance varies a lot and for a large number of language pairs it will not work at all.

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>rus<< You'd better not speak to Tom about that.",
    ">>ceb<< How are you?"
]

model_name = "pytorch-models/opus-mt-tc-bible-big-mul-mul"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Лучше бы не поговорить с Томом об этом.
#     Sa unsang paagi ikaw?

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-mul-mul")
print(pipe(">>rus<< You'd better not speak to Tom about that."))

# expected output: Лучше бы не поговорить с Томом об этом.

Training

Data: opusTCv20230926+bt+jhubc (source)
Pre-processing: SentencePiece (spm64k,spm64k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.test.txt
test set scores: opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
multi-multi	tatoeba-test-v2020-07-28-v2023-09-26	0.51760	28.1	10000	73531

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Wed Oct 9 19:20:34 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-mul-mul