update readme

Browse files

Files changed (3) hide show

README.md +104 -59
examples/VIVOSDEV02_R005.wav +0 -0
examples/common_voice_vi_30519757.mp3 +0 -0

README.md CHANGED Viewed

@@ -3,6 +3,8 @@ language: vi
 datasets:
 - vivos
 - common_voice
 metrics:
 - wer
 pipeline_tag: automatic-speech-recognition
@@ -10,21 +12,17 @@ tags:
 - audio
 - speech
 - Transformer
 license: cc-by-nc-4.0
 model-index:
 - name: Wav2vec2 Base Vietnamese 160h
   results:
-  - task:
-      name: Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: Common Voice vi
-      type: common_voice
-      args: vi
-    metrics:
-       - name: Test WER
-         type: wer
-         value: 0
   - task:
       name: Speech Recognition
       type: automatic-speech-recognition
@@ -35,7 +33,7 @@ model-index:
     metrics:
        - name: Test WER
          type: wer
-         value: 0
   - task:
       name: Speech Recognition
       type: automatic-speech-recognition
@@ -46,60 +44,107 @@ model-index:
     metrics:
        - name: Test WER
          type: wer
-         value: 0
 ---
-# FINETUNE WAV2VEC 2.0 FOR SPEECH RECOGNITION
-## Table of contents
-1. [Documentation](#documentation)
-2. [Installation](#installation)
-3. [Usage](#usage)
-4. [Logs and Visualization](#logs)
-<a name = "documentation" ></a>
-## Documentation
-Suppose you need a simple way to fine-tune the Wav2vec 2.0 model for the task of Speech Recognition on your datasets, then you came to the right place.
 </br>
-All documents related to this repo can be found here:
-- [Wav2vec2ForCTC](https://huggingface.co/docs/transformers/model_doc/wav2vec2#transformers.Wav2Vec2ForCTC)
-- [Tutorial](https://huggingface.co/blog/fine-tune-wav2vec2-english)
-- [Code reference](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py)
-<a name = "installation" ></a>
-## Installation
-```
-pip install -r requirements.txt
-```
-<a name = "usage" ></a>
-## Usage
-1. Prepare your dataset
-    - Your dataset can be in <b>.txt</b> or <b>.csv</b> format.
-    - <b>path</b> and <b>transcript</b> columns are compulsory. The <b>path</b> column contains the paths to your stored audio files, depending on your dataset location, it can be either absolute paths or relative paths. The <b>transcript</b> column contains the corresponding transcripts to the audio paths.
-    - Check out our [data_example.csv](dataset/data_example.csv) file for more information.
-2. Configure the config.toml file
-3. Run
-    - Start training:
-        ```
-        python train.py -c config.toml
-        ```
-    - Continue to train from resume:
-        ```
-        python train.py -c config.toml -r
-        ```
-    - Load specific model and start training:
-        ```
-        python train.py -c config.toml -p path/to/your/model.tar
-        ```
-<a name = "logs" ></a>
-## Logs and Visualization
-The logs during the training will be stored, and you can visualize it using TensorBoard by running this command:
 ```
-# specify the <name> in config.json
-tensorboard --logdir ~/saved/<name>
-# specify a port 8080
-tensorboard --logdir ~/saved/<name> --port 8080
 ```

 datasets:
 - vivos
 - common_voice
+- fpt
+- vlsp 100h
 metrics:
 - wer
 pipeline_tag: automatic-speech-recognition
 - audio
 - speech
 - Transformer
+- wav2vec2
+- automatic-speech-recognition
 license: cc-by-nc-4.0
+widget:
+- example_title: common_voice example
+  src: examples/common_voice_vi_30519757.mp3
+- example_title: vivos example
+  src: examples/VIVOSDEV02_R005.wav
 model-index:
 - name: Wav2vec2 Base Vietnamese 160h
   results:
   - task:
       name: Speech Recognition
       type: automatic-speech-recognition
     metrics:
        - name: Test WER
          type: wer
+         value: 10.78
   - task:
       name: Speech Recognition
       type: automatic-speech-recognition
     metrics:
        - name: Test WER
          type: wer
+         value: 15.05
 ---
+# Vietnamese Speech Recognition using Wav2vec 2.0
+### Table of contents
+1. [Model Description](#description)
+2. [Benchmark Result](#benchmark)
+3. [Example Usage](#example)
+4. [Evaluation](#evaluation)
+5. [Contact](#contact)
+<a name = "description" ></a>
+### Model Description
+Fine-tune the Wav2vec2-based model on about 160 hours of Vietnamese speech dataset from different resources including [VIOS](https://huggingface.co/datasets/vivos), [COMMON VOICE](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0), [FPT](https://data.mendeley.com/datasets/k9sxg2twv4/4) and [VLSP 100h](https://drive.google.com/file/d/1vUSxdORDxk-ePUt-bUVDahpoXiqKchMx/view). We have not yet incorporated the Language Model (which will be included in future work) into our ASR system but still gained a promising result.
+<br>
+We also provide code for Pre-training and Fine-tuning the Wav2vec2 model (not available for now but will release soon). If you wish to train on your dataset, check it out here:
+1. [Pretrain](https://github.com/khanld/ASR-Wav2vec-Pretrain)
+2. [Finetune](https://github.com/khanld/ASR-Wa2vec-Finetune)
 </br>
+<a name = "benchmark" ></a>
+### Benchmark WER Result
+| | [VIVOS](https://huggingface.co/datasets/vivos) | [COMMON VOICE 8.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) |
+|---|---|---|
+|without LM| 15.05 | 10.78 |
+|with LM| in progress | in progress |
+<a name = "example" ></a>
+### Example Usage
+```python
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+import librosa
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
+model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
+model.to(device)
+def transcribe(wav):
+  input_values = processor(wav, sampling_rate=16000, return_tensors="pt").input_values
+  logits = model(input_values.to(device)).logits
+  pred_ids = torch.argmax(logits, dim=-1)
+  pred_transcript = processor.batch_decode(pred_ids)[0]
+  return pred_transcript
+wav, _ = librosa.load('path/to/your/audio/file', sr = 16000)
+print(f"transcript: {transcribe(wav)}")
 ```
+<a name = "evaluation"></a>
+### Evaluation
+```python
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+from datasets import load_dataset
+import torch
+import re
+from datasets import load_dataset, load_metric, Audio
+wer = load_metric("wer")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# load processor and model
+processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
+model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
+model.to(device)
+model.eval()
+# Load dataset
+test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "vi", split="test")
+test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16000))
+chars_to_ignore = r'[,?.!\-;:"“%\'�]' # ignore special characters
+# preprocess data
+def preprocess(batch):
+  audio = batch["audio"]
+  batch["input_values"] = audio["array"]
+  batch["transcript"] = re.sub(chars_to_ignore, '', batch["sentence"]).lower()
+  return batch
+# run inference
+def inference(batch):
+  input_values = processor(batch["input_values"],
+                            sampling_rate=16000,
+                            return_tensors="pt").input_values
+  logits = model(input_values.to(device)).logits
+  pred_ids = torch.argmax(logits, dim=-1)
+  batch["pred_transcript"] = processor.batch_decode(pred_ids)
+  return batch
+test_dataset = test_dataset.map(preprocess)
+result = test_dataset.map(inference, batched=True, batch_size=1)
+print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_transcript"], references=result["transcript"])))
 ```
+**Test Result**: 10.78%
+<a name = "contact"></a>
+### Contact
+[email protected]
+</br>
+[![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/)<br>
+[![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)

examples/VIVOSDEV02_R005.wav ADDED Viewed

Binary file (84 kB). View file

examples/common_voice_vi_30519757.mp3 ADDED Viewed

Binary file (27.7 kB). View file