opus-mt-tc-bible-big-mul-deu_eng_nld

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from Multiple languages (mul) to unknown (deu+eng+nld). Note that many of the listed languages will not be well supported by the model as the training data is very limited for the majority of the languages. Translation performance varies a lot and for a large number of language pairs it will not work at all.

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-08-18
License: Apache-2.0
Language(s):
- Source Language(s): aai aar aau abi abk acd ace acf ach acm acn acr ade adj ady aeu aey afb afh afr agd agn agu ahk aia aka akh akl akp alj aln alp alq alt alz ame amh ami amk amu ang ann anp anv aoz apc apr apu ara arc arg arq arz asm aso ast atg atj atq aui auy ava avk avn avu awa awb awx aze azg azz bak bal bam ban bar bas bav bba bbo bbr bcl bcw bef beh bel bem ben bep bex bfa bfd bfo bgr bhl bho bhz bib bik bim bis biv bjr bjv bku bkv blh blt blz bmh bmk bmq bmu bmv bnp bod boj bom bos bov box bpr bps bpy bqc bqj bqp bre bru brx bss btd bth bto bts btt btx bua bud bug buk bul bus bvy bwq bwu byn bzd bzh bzj bzt caa cab cac cak cat cay cbk cce cco ceb ces cfm cgc cha che chf chm chq chr chu chv chy chz cjk cjo cjp cjv cjy ckb cko cle cme cmn cmo cmr cnh cni cnl cnr cnt cnw cok cop cor cos cot cpa cpu cre crh crn crs crx csb csk cso csy cta ctd ctp ctu cuc cui cuk cut cux cwe cwt cya cym czt daa dad dag dah dan ded deu dga dgi dig dik din diq div dje djk dng dni dnj dob dop drt dsb dsh dtp dty dug dws dww dyi dyo dyu dzo efi egl ell emi eng enm epo ess est eus ewe ext fai fal fao far fas fij fil fin fkv fon for fra frd frm frp frr fry fuc ful fur gag gah gaw gbm gcf gde gej gfk ghs gil gkn gla gle glg glk glv gnd gng gog gor gos got gqr grc grn gsw guc gud guh guj guo gur guw gux gvf gvl gwi gwr gym gyr hag hat hau haw hay hbo hbs hch heb heh her hif hig hil hin hla hlt hmn hne hnj hnn hns hoc hot hrv hrx hsb hsn hui hun hus hvn hwc hye hyw iba ibo icr ido ifa ifb ife ifk ifu ify ign iii ike iku ile ilo imo ina ind inh ino iou ipi ipk iri irk iry isl ita itv ium ixl izh izr jaa jac jak jam jav jbo jbu jdt jmc jpa jpn jun jvn kaa kab kac kal kam kan kao kas kat kau kaz kbd kbm kbp kdc kdj kdl kdn kea kek ken keo ker keu kew kez kgf kgk kha khm khz kia kik kin kir kjb kje kjh kjs kki kkj kle kma kmb kmg kmh kmo kmr kmu knc kne knj knk kno kog koi kok kom kon kpf kpg kpr kpv kpw kpz kqe kqf kqp kqw krc kri krj krl kru ksb ksh ksr ktb ktj kua kub kud kue kum kur kus kvn kwf kxc kxm kyc kyf kyg kyq kzf laa lac lad lah lao las lat lav law lbe lcm ldn lee lef lem leu lew lex lez lfn lgg lhu lia lid lif lij lim lin lip lit liv ljp lkt lld lln lme lmo lnd lob lok lon lou lrc lsi ltz lua luc lug luo lus lut luy lzz maa mad mag mah mai maj mak mal mam maq mar mau maw max maz mbb mbf mbt mcb mcp mcu mda mdf med mee meh mek men meq mfe mfh mfi mfk mfq mfy mgd mgm mgo mhi mhl mhx mhy mib mic mie mif mig mih mil mio mit mix miy miz mjc mkd mks mlg mlh mlp mlt mmo mmx mna mnb mnf mnh mni mnr mnw moa mog moh mol mon mop mor mos mox mpg mpm mpt mpx mqb mqj mri mrj mrw msa msm mta muh mux muy mva mvp mvv mwc mwl mwm mwv mww mxb mxt mya myb myk myu myv myw myx mzk mzm mzn mzw mzz naf nak nap nas nau nav nbl nca nch ncj ncl ncu nde ndo nds ndz neb nep new nfr ngt ngu nhe nhg nhi nhn nhu nhw nhx nhy nia nif nii nij nim nin niu njm nlc nld nlv nmz nnb nnh nno nnw nob nog non nop nor not nou nov npi npl npy nqo nsn nso nss nst nsu ntm ntp ntr nuj nus nuy nwb nwi nya nyf nyn nyo nyy nzi oar obo oci ofs oji oku okv old omw ood opm ori orm orv osp oss ota ote otm otn otq ozm pab pad pag pai pal pam pan pao pap pau pbi pbl pck pcm pdc pes pfl phn pib pih pio pis pkb pli pls plt plw pmf pms pmy pne pnt poe poh pol por pot ppk ppl prf prg prs ptp ptu pus pwg pww quc qya rai rap rav rej rhg rif rim rmy roh rom ron rop rro rue rug run rup rus rwo sab sag sah san sas sat sba sbd sbl scn sco sda sdh seh ses sgb sgs sgw sgz shi shk shn shs shy sig sil sin sjn skr sld slk sll slv sma sme smk sml smn smo sna snc snd snp snw som sot soy spa spl spp sps sqi srd srm srn srp srq ssd ssw ssx stn stp stq sue suk sun sur sus suz swa swc swe swg swh swp sxb sxn syc syl syr szb szl tab tac tah taj tam taq tat tbc tbl tbo tbz tcs tcy tel tem teo ter tet tfr tgk tgl tgo tgp tha thk tig tik tim tir tkl tlb tlf tlh tlj tlx tly tmc tmh tmr tmw toh toi toj ton tpa tpi tpm tpw tpz trc trn trq trs trv tsn tso tsw ttc tte ttr tts tuc tuf tuk tum tur tvl twb twi twu txa tyj tyv tzh tzj tzl tzm tzo ubr ubu udm udu uig ukr umb urd usa usp uvl uzb vag vec ven vie viv vls vmw vmy vol vot vro vun wae waj wal wap war wbm wbp wed wln wmt wmw wnc wnu wob wol wsk wuu wuv xal xcl xed xho xmf xog xon xrb xsb xsi xsm xsr xtd xtm xuo yal yam yaq yaz yby ycl ycn yid yli yml yon yor yua yue yut yuw zam zap zea zgh zha zia zlm zom zsm zul zyp zza
- Target Language(s): deu eng nld
- Valid Target Language Labels: >>deu<< >>eng<< >>nld<< >>xxx<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-18.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

Also note that many of the listed languages will not be well supported by the model as the training data is very limited for the majority of the languages. Translation performance varies a lot and for a large number of language pairs it will not work at all.

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>eng<< Jedes Mädchen, das ich sehe, gefällt mir.",
    ">>nld<< I don't know if it is true."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-mul-deu_eng_nld"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     I like every girl I see.
#     Ik weet niet of het waar is.

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-mul-deu_eng_nld")
print(pipe(">>eng<< Jedes Mädchen, das ich sehe, gefällt mir."))

# expected output: I like every girl I see.

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-18.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-18.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-18.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
multi-multi	tatoeba-test-v2020-07-28-v2023-09-26	0.61102	41.7	10000	78944

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Tue Oct 8 12:27:24 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-mul-deu_eng_nld

opus-mt-tc-bible-big-mul-deu_eng_nld

Table of Contents

Model Details

Uses

Risks, Limitations and Biases

How to Get Started With the Model

Training

Evaluation

Citation Information

Acknowledgements

Model conversion info

Model tree for Helsinki-NLP/opus-mt-tc-bible-big-mul-deu_eng_nld

Collection including Helsinki-NLP/opus-mt-tc-bible-big-mul-deu_eng_nld

OPUS-MT multilingual (TC+Bible)

Evaluation results