NLLB-200 Distilled: English-Thai Bible Translation Model

Introduction

The NLLB-200 Distilled: English-Thai Bible Translation Model is a fine-tuned version of the facebook/nllb-200-distilled-600M, designed specifically for bidirectional Bible translation between English and Thai. This model provides translations for both English-to-Thai and Thai-to-English directions, suitable for applications involving religious texts, with attention to context and meaning.

Training Dataset

This model was fine-tuned using a combination of religious and general domain datasets to provide accurate and context-aware translations, especially for biblical texts:

Bible-Specific English-Thai Datasets:
- Tsunnami/en-th-bible: Bible-focused dataset providing parallel English-Thai verses.
- Tsunnami/en-th-bible-splits: A split version optimized for training.
General English-Thai Datasets for Broader Linguistic Coverage:
- Tsunnami/who-en-th: General-purpose English-Thai data to improve the model's understanding of diverse vocabulary.
- scb10x/scb_mt_enth_2020_aqdf_1k: Additional English-Thai dataset to enhance linguistic robustness.

In total, the model was trained on 32,441 rows of English-Thai text pairs, with a focus on both biblical and general language use.

Training Methodology

The NLLB-200 Distilled: English-Thai Bible Translation Model integrates the NLLB-200 default tokenizer to process both English and Thai text effectively. Training for this bidirectional model, covering both English-to-Thai and Thai-to-English translations, was completed in 7 GPU hours on NVIDIA P100-16GB (250W TDP) hardware, using Python Lightning for efficient model training and management. This training approach ensures the model can handle input in either language and generate accurate translations in the target direction.

How to Use This Model for Bidirectional Bible Translation

The model can translate Bible verses in both directions: from English to Thai and from Thai to English. The following example demonstrates Thai-to-English translation; to reverse the translation direction, switch the source_lang and target_lang variables accordingly.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load the model and tokenizer
model_name = "suchut/nllb-200-distilled-bible-en-th"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set source and target languages for bidirectional translation
# Example 1: Thai-to-English
source_lang = "tha_Thai"  # Thai as source language
target_lang = "eng_Latn"  # English as target language

# Example input text in Thai
input_text = "ในเริ่มแรกนั้นพระเจ้าทรงเนรมิตสร้างฟ้าและแผ่นดินโลก"

# Tokenize the input text with language prefix
inputs = tokenizer(f"{source_lang} {input_text}", return_tensors="pt")

# Generate the translation and force output to target language
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.encode(target_lang)[0],
    max_length=128
)

# Decode and print the translated text
decoded_translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
print("Translated Text (Thai to English):", decoded_translation)

# To translate from English to Thai, simply switch source_lang and target_lang:
# source_lang = "eng_Latn"
# target_lang = "tha_Thai"
# input_text = "In the beginning, God created the heavens and the earth."

Key Points to Note

Bidirectional Translation: This model’s bidirectional capability allows translation both from English to Thai and from Thai to English. Specify source_lang and target_lang parameters to choose the translation direction as needed.
Language Codes: Use "eng_Latn" for English and "tha_Thai" for Thai. Prefix the input text with the source_lang code to guide the model's translation processing.
Forced BOS Token for Target Language: To ensure correct output language, the model uses forced_bos_token_id, setting the beginning of the output to the specified target language.

This model provides an essential tool for bilingual Bible translation, suitable for applications in religious studies, cross-linguistic analysis, and more.