--- license: mit language: - bn metrics: - wer - cer tags: - seq2seq - ipa - bengali - byt5 widget: - text: " আমি সে বাবুর মামু বাড়ি গিছিলাম।" --- # Regional bengali text to IPA transcription - byT5-small ## A word of caution: the model is constantly being updated! You may see jumps in performance! This is a fine-tuned version of the [google/byt5-small](https://huggingface.co/google/byt5-small) for the task of generating IPA transcriptions from regional bengali text. This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI. Scores achieved till now (test scores): - **Word error rate (wer)**: 0.01732 - **Char error rate (cer)**: 0.01491 Supported district tokens: - Kishoreganj - Narail - Narsingdi - Chittagong - Rangpur - Tangail ## Loading & using the model ```python # Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("smji/ben2ipa-byt5small") model = AutoModelForSeq2SeqLM.from_pretrained("smji/ben2ipa-byt5small") """ The format of the input text MUST BE: """ text = " bengali_text_here" text_ids = tokenizer(text, return_tensors='pt').input_ids model(text_ids) ``` ## Using the pipeline ```python # Use a pipeline as a high-level helper from transformers import pipeline device = "cuda" if torch.cuda.is_available() else "cpu" pipe = pipeline("text2text-generation", model="smji/ben2ipa-byt5small", device=device) """ `texts` must be in the format of: """ outputs = pipe(texts, max_length=1024, batch_size=batch_size) ``` ## Credits Done by [S M Jishanul Islam](https://github.com/S-M-J-I), [Sadia Ahmmed](https://github.com/sadia-ahmmed), [Sahid Hossain Mustakim](https://github.com/sratul35)