You can test this model online with the Space for Romanian Speech Recognition

The model ranked TOP-1 on Romanian Speech Recognition during HuggingFace's Robust Speech Challenge :

Romanian Wav2Vec2

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the Common Voice 8.0 - Romanian subset dataset, with extra training data from Romanian Speech Synthesis dataset.

Without the 5-gram Language Model optimization, it achieves the following results on the evaluation set (Common Voice 8.0, Romanian subset, test split):

Loss: 0.1553
Wer: 0.1174
Cer: 0.0294

Model description

The architecture is based on facebook/wav2vec2-xls-r-300m with a speech recognition CTC head and an added 5-gram language model (using pyctcdecode and kenlm) trained on the Romanian Corpora Parliament dataset. Those libraries are needed in order for the language model-boosted decoder to work.

Intended uses & limitations

The model is made for speech recognition in Romanian from audio clips sampled at 16kHz. The predicted text is lowercased and does not contain any punctuation.

How to use

Make sure you have installed the correct dependencies for the language model-boosted version to work. You can just run this command to install the kenlm and pyctcdecode libraries :

pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode

With the framework transformers you can load the model with the following code :

from transformers import AutoProcessor, AutoModelForCTC

processor = AutoProcessor.from_pretrained("gigant/romanian-wav2vec2")

model = AutoModelForCTC.from_pretrained("gigant/romanian-wav2vec2")

Or, if you want to test the model, you can load the automatic speech recognition pipeline from transformers with :

from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="gigant/romanian-wav2vec2")

Example use with the `datasets` library

First, you need to load your data

We will use the Romanian Speech Synthesis dataset in this example.

from datasets import load_dataset

dataset = load_dataset("gigant/romanian_speech_synthesis_0_8_1")

You can listen to the samples with the IPython.display library :

from IPython.display import Audio

i = 0
sample = dataset["train"][i]
Audio(sample["audio"]["array"], rate = sample["audio"]["sampling_rate"])

The model is trained to work with audio sampled at 16kHz, so if the sampling rate of the audio in the dataset is different, we will have to resample it.

In the example, the audio is sampled at 48kHz. We can see this by checking dataset["train"][0]["audio"]["sampling_rate"]

The following code resample the audio using the torchaudio library :

import torchaudio
import torch

i = 0
audio = sample["audio"]["array"]
rate = sample["audio"]["sampling_rate"]
resampler = torchaudio.transforms.Resample(rate, 16_000)
audio_16 = resampler(torch.Tensor(audio)).numpy()

To listen to the resampled sample :

Audio(audio_16, rate=16000)

Know you can get the model prediction by running

predicted_text = asr(audio_16)
ground_truth = dataset["train"][i]["sentence"]

print(f"Predicted text : {predicted_text}")
print(f"Ground truth : {ground_truth}")

Training and evaluation data

Training data :

Common Voice 8.0 - Romanian subset : train + validation + other splits
Romanian Speech Synthesis : train + test splits

Evaluation data :

Common Voice 8.0 - Romanian subset : test split

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.003
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 3
total_train_batch_size: 48
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 50.0
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
2.9272	0.78	500	0.7603	0.7734	0.2355
0.6157	1.55	1000	0.4003	0.4866	0.1247
0.4452	2.33	1500	0.2960	0.3689	0.0910
0.3631	3.11	2000	0.2580	0.3205	0.0796
0.3153	3.88	2500	0.2465	0.2977	0.0747
0.2795	4.66	3000	0.2274	0.2789	0.0694
0.2615	5.43	3500	0.2277	0.2685	0.0675
0.2389	6.21	4000	0.2135	0.2518	0.0627
0.2229	6.99	4500	0.2054	0.2449	0.0614
0.2067	7.76	5000	0.2096	0.2378	0.0597
0.1977	8.54	5500	0.2042	0.2387	0.0600
0.1896	9.32	6000	0.2110	0.2383	0.0595
0.1801	10.09	6500	0.1909	0.2165	0.0548
0.174	10.87	7000	0.1883	0.2206	0.0559
0.1685	11.65	7500	0.1848	0.2097	0.0528
0.1591	12.42	8000	0.1851	0.2039	0.0514
0.1537	13.2	8500	0.1881	0.2065	0.0518
0.1504	13.97	9000	0.1840	0.1972	0.0499
0.145	14.75	9500	0.1845	0.2029	0.0517
0.1417	15.53	10000	0.1884	0.2003	0.0507
0.1364	16.3	10500	0.2010	0.2037	0.0517
0.1331	17.08	11000	0.1838	0.1923	0.0483
0.129	17.86	11500	0.1818	0.1922	0.0489
0.1198	18.63	12000	0.1760	0.1861	0.0465
0.1203	19.41	12500	0.1686	0.1839	0.0465
0.1225	20.19	13000	0.1828	0.1920	0.0479
0.1145	20.96	13500	0.1673	0.1784	0.0446
0.1053	21.74	14000	0.1802	0.1810	0.0456
0.1071	22.51	14500	0.1769	0.1775	0.0444
0.1053	23.29	15000	0.1920	0.1783	0.0457
0.1024	24.07	15500	0.1904	0.1775	0.0446
0.0987	24.84	16000	0.1793	0.1762	0.0446
0.0949	25.62	16500	0.1801	0.1766	0.0443
0.0942	26.4	17000	0.1731	0.1659	0.0423
0.0906	27.17	17500	0.1776	0.1698	0.0424
0.0861	27.95	18000	0.1716	0.1600	0.0406
0.0851	28.73	18500	0.1662	0.1630	0.0410
0.0844	29.5	19000	0.1671	0.1572	0.0393
0.0792	30.28	19500	0.1768	0.1599	0.0407
0.0798	31.06	20000	0.1732	0.1558	0.0394
0.0779	31.83	20500	0.1694	0.1544	0.0388
0.0718	32.61	21000	0.1709	0.1578	0.0399
0.0732	33.38	21500	0.1697	0.1523	0.0391
0.0708	34.16	22000	0.1616	0.1474	0.0375
0.0678	34.94	22500	0.1698	0.1474	0.0375
0.0642	35.71	23000	0.1681	0.1459	0.0369
0.0661	36.49	23500	0.1612	0.1411	0.0357
0.0629	37.27	24000	0.1662	0.1414	0.0355
0.0587	38.04	24500	0.1659	0.1408	0.0351
0.0581	38.82	25000	0.1612	0.1382	0.0352
0.0556	39.6	25500	0.1647	0.1376	0.0345
0.0543	40.37	26000	0.1658	0.1335	0.0337
0.052	41.15	26500	0.1716	0.1369	0.0343
0.0513	41.92	27000	0.1600	0.1317	0.0330
0.0491	42.7	27500	0.1671	0.1311	0.0328
0.0463	43.48	28000	0.1613	0.1289	0.0324
0.0468	44.25	28500	0.1599	0.1260	0.0315
0.0435	45.03	29000	0.1556	0.1232	0.0308
0.043	45.81	29500	0.1588	0.1240	0.0309
0.0421	46.58	30000	0.1567	0.1217	0.0308
0.04	47.36	30500	0.1533	0.1198	0.0302
0.0389	48.14	31000	0.1582	0.1185	0.0297
0.0387	48.91	31500	0.1576	0.1187	0.0297
0.0376	49.69	32000	0.1560	0.1182	0.0295

Framework versions

Transformers 4.16.2
Pytorch 1.10.0+cu111
Tokenizers 0.11.0
pyctcdecode 0.3.0
kenlm

Downloads last month: 2,342

Safetensors

Model size

315M params

Tensor type

F32

Inference Examples

Automatic Speech Recognition

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for gigant/romanian-wav2vec2

Base model

facebook/wav2vec2-xls-r-300m

Finetuned

(455)

this model

Datasets used to train gigant/romanian-wav2vec2

Spaces using gigant/romanian-wav2vec2 2

Collection including gigant/romanian-wav2vec2

Romanian speech recognition

Collection

3 items • Updated Sep 25, 2023

Evaluation results

Dev WER (without LM) on Robust Speech Event
self-reported

46.990
Dev CER (without LM) on Robust Speech Event
self-reported

16.040
Dev WER (with LM) on Robust Speech Event
self-reported

38.630
Dev CER (with LM) on Robust Speech Event
self-reported

14.520
Test WER (without LM) on Common Voice
self-reported

11.730
Test CER (without LM) on Common Voice
self-reported

2.930
Test WER (with LM) on Common Voice
self-reported

7.310
Test CER (with LM) on Common Voice
self-reported

2.170
Test WER on Robust Speech Event - Test Data
self-reported

43.230

View on Papers With Code