Bidirectional Vietnamese Nôm Transliteration

Vietnamese Nôm, or Chữ Nôm, was an ancient writing system in Vietnam before the 20th century. It evolved from Chinese characters but adapted to Vietnamese sounds and vocabulary. Nôm was used by scholars for literature and communication. The script visually differed from Chinese characters and expressed Vietnamese concepts with semantic and phonetic components. Today, Nôm is a specialized field, and efforts are made to preserve its knowledge. Though modern Vietnamese uses the Latin alphabet, Nôm remains an integral part of Vietnam's cultural heritage.

State-of-the-art lightweights pretrained Transformer-based encoder-decoder model for Vietnamese Nom translation.

Model trained on dataset Luc-Van- Tien’s book, Tale Of Kieu book, “History of Greater Vietnam” book, “Chinh Phu Ngam Khuc” poems, “Ho Xuan Huong” poems, Corpus documents from chunom.org, sample texts coming from 130 different books (Tu-Dien-Chu-Nom-Dan Giai).

The model is trained and supports bidirectional translation between Vietnamese Nôm script and Vietnamese Latin script, enabling the translation from Nôm to Vietnamese Latin script and vice versa.

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("minhtoan/t5-translate-vietnamese-nom")  
model = AutoModelForSeq2SeqLM.from_pretrained("minhtoan/t5-translate-vietnamese-nom")
model.cuda()
src = "如梅早杏遲管"
tokenized_text = tokenizer.encode(src, return_tensors="pt").cuda()
model.eval()
translate_ids = model.generate(tokenized_text, max_length=48)
output = tokenizer.decode(translate_ids[0], skip_special_tokens=True)
output

'như mai tảo hạnh trì quán'

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("minhtoan/t5-translate-vietnamese-nom")  
model = AutoModelForSeq2SeqLM.from_pretrained("minhtoan/t5-translate-vietnamese-nom")
model.cuda()
src = "như mai tảo hạnh trì quán"
tokenized_text = tokenizer.encode(src, return_tensors="pt").cuda()
model.eval()
translate_ids = model.generate(tokenized_text, max_length=48)
output = tokenizer.decode(translate_ids[0], skip_special_tokens=True)
output

'如梅早杏遲舘'

Author

Phan Minh Toan