metadata
language: ja
datasets:
- common_voice
metrics:
- wer
- cer
model-index:
- name: >-
wav2vec2-xls-r-300m finetuned on Japanese Hiragana with no word boundaries
by Hyungshin Ryu of SLPlab
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice Japanese
type: common_voice
args: ja
metrics:
- name: Test WER
type: wer
value: 90.66
- name: Test CER
type: cer
value: 19.35
Wav2Vec2-XLS-R-300M-Japanese-Hiragana
Fine-tuned facebook/wav2vec2-xls-r-300m on Japanese Hiragana characters using the Common Voice and JSUT. The sentence outputs do not contain word boundaries. Audio inputs should be sampled at 16kHz.
Usage
The model can be used directly as follows:
!pip install mecab-python3
!pip install unidic-lite
!pip install pykakasi
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset, load_metric
import pykakasi
import MeCab
import re
# load datasets, processor, and model
test_dataset = load_dataset("common_voice", "ja", split="test")
wer = load_metric("wer")
cer = load_metric("cer")
PTM = "slplab/wav2vec2-xls-r-300m-japanese-hiragana"
print("PTM:", PTM)
processor = Wav2Vec2Processor.from_pretrained(PTM)
model = Wav2Vec2ForCTC.from_pretrained(PTM)
device = "cuda"
model.to(device)
# preprocess datasets
wakati = MeCab.Tagger("-Owakati")
kakasi = pykakasi.kakasi()
chars_to_ignore_regex = "[、,。]"
def speech_file_to_array_fn_hiragana_nospace(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).strip()
batch["sentence"] = ''.join([d['hira'] for d in kakasi.convert(batch["sentence"])])
speech_array, sampling_rate = torchaudio.load(batch["path"])
resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
batch["speech"] = resampler(speech_array).squeeze()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn_hiragana_nospace)
#evaluate
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to(device)).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
for i in range(10):
print("="*20)
print("Prd:", result[i]["pred_strings"])
print("Ref:", result[i]["sentence"])
print("WER: {:2f}%".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
print("CER: {:2f}%".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Original Text | Prediction |
---|---|
この料理は家庭で作れます。 | このりょうりはかていでつくれます |
日本人は、決して、ユーモアと無縁な人種ではなかった。 | にっぽんじんはけしてゆうもあどむえんなじんしゅではなかった |
木村さんに電話を貸してもらいました。 | きむらさんにでんわおかしてもらいました |
Test Results
WER: 90.66%, CER: 19.35%
Training
Trained on JSUT and train+valid set of Common Voice Japanese. Tested on test set of Common Voice Japanese.