File size: 2,660 Bytes
f9c22f9 22c5546 20d4698 4c3bb5d 22c5546 f9c22f9 20d4698 3efc245 20d4698 60e70c7 777f1e3 20d4698 267a0c5 08732c6 267a0c5 20d4698 b2cb9e5 85851cc b2cb9e5 85851cc b2cb9e5 6426852 20d4698 1e38ea9 20d4698 b2cb9e5 20d4698 b2cb9e5 20d4698 b2cb9e5 1e38ea9 60e70c7 ff20bfd 60e70c7 20d4698 6426852 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
license: apache-2.0
language:
- bn
metrics:
- wer
- cer
tags:
- seq2seq
- ipa
- bengali
- byt5
widget:
- text: <Narail> আমি সে বাবুর মামু বাড়ি গিছিলাম।
example_title: Narail Text
- text: <Rangpur> এখন এই কুলো তার শেষ অই কুলো তার শেষ।
example_title: Rangpur Text
- text: <Chittagong> খয়দে সিআরের এইল্লা কি অবস্থা!
example_title: Chittagong Text
- text: <Kishoreganj> আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি।
example_title: Kishoreganj Text
- text: <Narsingdi> তারা তো ওই খারাপ খেইলাই আসে না।
example_title: Narsingdi Text
- text: <Tangail> আর সব থেকে ফানি কথা হইতেছে দেখ?
example_title: Tangail Text
---
# Regional bengali text to IPA transcription - byT5-small
This is a fine-tuned version of the [google/byt5-small](https://huggingface.co/google/byt5-small) for the task of generating IPA transcriptions from regional bengali text.
This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI.
Model performance:
- **Word error rate (wer)**: 0.0124279344454407
- **Char error rate (cer)**: 0.00427635805681347
Supported district tokens:
- Kishoreganj
- Narail
- Narsingdi
- Chittagong
- Rangpur
- Tangail
---
## Loading & using the model
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-byt5small")
model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-byt5small")
"""
The format of the input text MUST BE: <district> <bengali_text>
"""
text = "<district> bengali_text_here"
text_ids = tokenizer(text, return_tensors='pt').input_ids
model(text_ids)
```
## Using the pipeline
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-byt5small", device=device)
"""
`texts` must be in the format of: <district> <contents>
"""
outputs = pipe(texts, max_length=1024, batch_size=batch_size)
```
## Credits
Done by [S M Jishanul Islam](https://github.com/S-M-J-I), [Sadia Ahmmed](https://huggingface.co/sadiaahmmed), [Sahid Hossain Mustakim](https://huggingface.co/rhsm15) |