|
--- |
|
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png |
|
language: ja |
|
license: apache-2.0 |
|
datasets: reazon-research/reazonspeech |
|
pipeline_tag: feature-extraction |
|
inference: false |
|
tags: |
|
- wav2vec2 |
|
- speech |
|
--- |
|
|
|
# `rinna/japanese-wav2vec2-base` |
|
|
|
![rinna-icon](./rinna.png) |
|
|
|
# Overview |
|
|
|
This is a Japanese wav2vec 2.0 Base model trained by [rinna Co., Ltd.](https://rinna.co.jp/) |
|
|
|
* **Model summary** |
|
|
|
The model architecture is the same as the [original wav2vec 2.0 Base model](https://huggingface.co/facebook/wav2vec2-base), which contains 12 transformer layers with 12 attention heads. |
|
The model was trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec), and the detailed training configuration can be found in the same repository and the [original paper](https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html). |
|
|
|
|
|
* **Training** |
|
|
|
The model was trained on approximately 19,000 hours of following Japanese speech corpus ReazonSpeech v1. |
|
- [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) |
|
|
|
* **Contributors** |
|
|
|
- [Yukiya Hono](https://huggingface.co/yky-h) |
|
- [Kentaro Mitsui](https://huggingface.co/Kentaro321) |
|
- [Kei Sawada](https://huggingface.co/keisawada) |
|
|
|
--- |
|
|
|
# How to use the model |
|
|
|
```python |
|
import soundfile as sf |
|
from transformers import AutoFeatureExtractor, AutoModel |
|
|
|
model_name = "rinna/japanese-wav2vec2-base" |
|
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name) |
|
model = AutoModel.from_pretrained(model_name) |
|
model.eval() |
|
|
|
raw_speech_16kHz, sr = sf.read(audio_file) |
|
inputs = feature_extractor( |
|
raw_speech_16kHz, |
|
return_tensors="pt", |
|
sampling_rate=sr, |
|
) |
|
outputs = model(**inputs) |
|
|
|
print(f"Input: {inputs.input_values.size()}") # [1, #samples] |
|
print(f"Output: {outputs.last_hidden_state.size()}") # [1, #frames, 768] |
|
``` |
|
|
|
A fairseq checkpoint file can also be available [here](https://huggingface.co/rinna/japanese-wav2vec2-base/tree/main/fairseq). |
|
|
|
--- |
|
|
|
# How to cite |
|
```bibtex |
|
@misc{rinna-japanese-wav2vec2-base, |
|
title = {rinna/japanese-wav2vec2-base}, |
|
author = {Hono, Yukiya and Mitsui, Kentaro and Sawada, Kei}, |
|
url = {https://huggingface.co/rinna/japanese-wav2vec2-base} |
|
} |
|
|
|
@inproceedings{sawada2024release, |
|
title = {Release of Pre-Trained Models for the {J}apanese Language}, |
|
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh}, |
|
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, |
|
month = {5}, |
|
year = {2024}, |
|
pages = {13898--13905}, |
|
url = {https://aclanthology.org/2024.lrec-main.1213}, |
|
note = {\url{https://arxiv.org/abs/2404.01657}} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
# References |
|
```bibtex |
|
@inproceedings{baevski2020wav2vec, |
|
title = {wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations}, |
|
author = {Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael}, |
|
booktitle = {Advances in Neural Information Processing Systems}, |
|
year = {2020}, |
|
volume = {33}, |
|
pages = {12449--12460}, |
|
url = {https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html} |
|
} |
|
``` |
|
--- |
|
|
|
# License |
|
[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) |
|
|