--- license: mit language: - bn metrics: - wer - cer tags: - seq2seq - ipa - bengali - byt5 widget: - text: " আমি সে বাবুর মামু বাড়ি গিছিলাম।" example_title: "Narail Text" - text: " এখন এই কুলো তার শেষ অই কুলো তার শেষ।" example_title: "Rangpur Text" - text: " খয়দে সিআরের এইল্লা কি অবস্থা!" example_title: "Chittagong Text" - text: " আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি।" example_title: "Kishoreganj Text" - text: " তারা তো ওই খারাপ খেইলাই আসে না।" example_title: "Narsingdi Text" - text: " আর সব থেকে ফানি কথা হইতেছে দেখ?" example_title: "Tangail Text" --- # Regional bengali text to IPA transcription - byT5-small ## A word of caution: the model is constantly being updated! You may see jumps in performance! This is a fine-tuned version of the [google/byt5-small](https://huggingface.co/google/byt5-small) for the task of generating IPA transcriptions from regional bengali text. This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI. Scores achieved till now (test scores): - **Word error rate (wer)**: 0.0158544536679983 - **Char error rate (cer)**: 0.0066563929431140 Supported district tokens: - Kishoreganj - Narail - Narsingdi - Chittagong - Rangpur - Tangail District-wise accuracy: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63bdb5add7dea2e13e588fb0/hxvsiMJQR78QO2M9SbibZ.png) --- ## Loading & using the model ```python # Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("smji/ben2ipa-byt5small") model = AutoModelForSeq2SeqLM.from_pretrained("smji/ben2ipa-byt5small") """ The format of the input text MUST BE: """ text = " bengali_text_here" text_ids = tokenizer(text, return_tensors='pt').input_ids model(text_ids) ``` ## Using the pipeline ```python # Use a pipeline as a high-level helper from transformers import pipeline device = "cuda" if torch.cuda.is_available() else "cpu" pipe = pipeline("text2text-generation", model="smji/ben2ipa-byt5small", device=device) """ `texts` must be in the format of: """ outputs = pipe(texts, max_length=1024, batch_size=batch_size) ``` ## Credits Done by [S M Jishanul Islam](https://github.com/S-M-J-I), [Sadia Ahmmed](https://huggingface.co/sadiaahmmed), [Sahid Hossain Mustakim](https://huggingface.co/rhsm15)