---
license: mit
language:
- bn
metrics:
- wer
- cer
tags:
- seq2seq
- ipa
- bengali
- byt5
widget:
- text: "<Narail> আমি সে বাবুর মামু বাড়ি গিছিলাম।"
  example_title: "Narail Text"
- text: "<Rangpur> এখন এই কুলো তার শেষ অই কুলো তার শেষ।"
  example_title: "Rangpur Text"
- text: "<Chittagong> খয়দে সিআরের এইল্লা কি অবস্থা!"
  example_title: "Chittagong Text"
- text: "<Kishoreganj> আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি।"
  example_title: "Kishoreganj Text"
- text: "<Narsingdi> তারা তো ওই খারাপ খেইলাই আসে না।"
  example_title: "Narsingdi Text"
- text: "<Tangail> আর সব থেকে ফানি কথা হইতেছে দেখ?"
  example_title: "Tangail Text"
---


# Regional bengali text to IPA transcription - byT5-small

## A word of caution: the model is constantly being updated! You may see jumps in performance!

This is a fine-tuned version of the [google/byt5-small](https://huggingface.co/google/byt5-small) for the task of generating IPA transcriptions from regional bengali text. 
This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI.

Scores achieved till now (test scores):
- **Word error rate (wer)**: 0.0158544536679983
- **Char error rate (cer)**: 0.0066563929431140

Supported district tokens:
- Kishoreganj
- Narail
- Narsingdi
- Chittagong
- Rangpur
- Tangail

District-wise accuracy:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/63bdb5add7dea2e13e588fb0/hxvsiMJQR78QO2M9SbibZ.png)

---

## Loading & using the model
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("smji/ben2ipa-byt5small")
model = AutoModelForSeq2SeqLM.from_pretrained("smji/ben2ipa-byt5small")

"""
  The format of the input text MUST BE: <district> <bengali_text>
"""
text = "<district> bengali_text_here"
text_ids = tokenizer(text, return_tensors='pt').input_ids
model(text_ids)
```


## Using the pipeline
```python
# Use a pipeline as a high-level helper
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

pipe = pipeline("text2text-generation", model="smji/ben2ipa-byt5small", device=device)


"""
  `texts` must be in the format of: <district> <contents>
"""
outputs = pipe(texts, max_length=1024, batch_size=batch_size)
```

## Credits
Done by [S M Jishanul Islam](https://github.com/S-M-J-I), [Sadia Ahmmed](https://huggingface.co/sadiaahmmed), [Sahid Hossain Mustakim](https://huggingface.co/rhsm15)