Edit model card
A newer version of this model is available: AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19

General-purpose Latvian ASR model

This is a fine-tuned whisper-large-v3 model for Latvian, trained by AiLab.lv using two general-purpose speech datasets: the Latvian part of Common Voice 17.0, and a Latvian broadcast dataset LATE-Media.

We also provide 4-bit, 5-bit and 8-bit quantized versions of the model in the GGML format for the use with whisper.cpp, as well as an 8-bit quantized version for the use with CTranslate2.

NB! This model is superseded by a newer version: whisper-large-v3-lv-late-cv19

Training

Fine-tuning was done using the Hugging Face Transformers library with a modified seq2seq script.

Training data Hours
Latvian Common Voice 17.0 train set (the V1 split) 167
LATE-Media 1.0 train set 42
Total 209

Evaluation

Testing data WER CER
Latvian Common Voice 17.0 test set (V1) - formatted 5.0 1.6
Latvian Common Voice 17.0 test set (V1) - normalized 3.4 1.0
LATE-Media 1.0 test set - formatted 20.8 8.2
LATE-Media 1.0 test set - normalized 14.1 5.9

The Latvian CV 17.0 test set is available here. The LATE-Media 1.0 test set is available here.

Citation

Please cite this paper if you use this model in your research:

@inproceedings{dargis-etal-2024-balsutalka-lv,
  author = {Dargis, Roberts and Znotins, Arturs and Auzina, Ilze and Saulite, Baiba and Reinsone, Sanita and Dejus, Raivis and Klavinska, Antra and Gruzitis, Normunds},
  title = {{BalsuTalka.lv - Boosting the Common Voice Corpus for Low-Resource Languages}},
  booktitle = {Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)},
  publisher = {ELRA and ICCL},
  year = {2024},
  pages = {2080--2085},
  url = {https://aclanthology.org/2024.lrec-main.187}
}

Acknowledgements

This work was supported by the EU Recovery and Resilience Facility project Language Technology Initiative (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project LATE (VPP-LETONIKA-2021/1-0006). We are grateful to all the participants of the national initiative BalsuTalka.lv for helping to make the Latvian Common Voice dataset much larger and more diverse.

Downloads last month
75
Safetensors
Model size
1.61B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for AiLab-IMCS-UL/whisper-large-v3-lv-late-cv17

Finetuned
(298)
this model

Dataset used to train AiLab-IMCS-UL/whisper-large-v3-lv-late-cv17

Spaces using AiLab-IMCS-UL/whisper-large-v3-lv-late-cv17 2

Collection including AiLab-IMCS-UL/whisper-large-v3-lv-late-cv17