Wav2Vec2-XLS-R-300M-Japanese-Hiragana
Fine-tuned facebook/wav2vec2-xls-r-300m on Japanese Hiragana characters using the Common Voice and JSUT. The sentence outputs do not contain word boundaries. Audio inputs should be sampled at 16kHz.
Usage
The model can be used directly as follows:
!pip install mecab-python3
!pip install unidic-lite
!pip install pykakasi
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset, load_metric
import pykakasi
import MeCab
import re
# load datasets, processor, and model
test_dataset = load_dataset("common_voice", "ja", split="test")
wer = load_metric("wer")
cer = load_metric("cer")
PTM = "slplab/wav2vec2-xls-r-300m-japanese-hiragana"
print("PTM:", PTM)
processor = Wav2Vec2Processor.from_pretrained(PTM)
model = Wav2Vec2ForCTC.from_pretrained(PTM)
device = "cuda"
model.to(device)
# preprocess datasets
wakati = MeCab.Tagger("-Owakati")
kakasi = pykakasi.kakasi()
chars_to_ignore_regex = "[、,。]"
def speech_file_to_array_fn_hiragana_nospace(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).strip()
batch["sentence"] = ''.join([d['hira'] for d in kakasi.convert(batch["sentence"])])
speech_array, sampling_rate = torchaudio.load(batch["path"])
resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
batch["speech"] = resampler(speech_array).squeeze()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn_hiragana_nospace)
#evaluate
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to(device)).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
for i in range(10):
print("="*20)
print("Prd:", result[i]["pred_strings"])
print("Ref:", result[i]["sentence"])
print("WER: {:2f}%".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
print("CER: {:2f}%".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Original Text | Prediction |
---|---|
この料理は家庭で作れます。 | このりょうりはかていでつくれます |
日本人は、決して、ユーモアと無縁な人種ではなかった。 | にっぽんじんはけしてゆうもあどむえんなじんしゅではなかった |
木村さんに電話を貸してもらいました。 | きむらさんにでんわおかしてもらいました |
Test Results
WER: 90.66%, CER: 19.35%
Training
Trained on JSUT and train+valid set of Common Voice Japanese. Tested on test set of Common Voice Japanese.
- Downloads last month
- 15
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Dataset used to train slplab/wav2vec2-xls-r-300m-japanese-hiragana
Evaluation results
- Test WER on Common Voice Japaneseself-reported90.660
- Test CER on Common Voice Japaneseself-reported19.350