Aspik101 commited on
Commit
5a5b057
1 Parent(s): 7994d53

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pl
4
+ tags:
5
+ - audio
6
+ - automatic-speech-recognition
7
+ - transformers.js
8
+ pipeline_tag: automatic-speech-recognition
9
+ license: mit
10
+ library_name: transformers
11
+ ---
12
+ # Polish W2v-BERT 2.0 speech encoder
13
+
14
+ We are open-sourcing our Conformer-based [W2v-BERT 2.0 speech encoder](#w2v-bert-20-speech-encoder) as described in Section 3.2.1 of the [paper](https://arxiv.org/pdf/2312.05187.pdf), which is at the core of our Seamless models.
15
+
16
+ This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
17
+
18
+ | Model Name | #params | checkpoint |
19
+ | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
20
+ | W2v-BERT 2.0 | 600M | [checkpoint](https://huggingface.co/reach-vb/conformer-shaw/resolve/main/conformer_shaw.pt)
21
+
22
+ **This model and its training are supported by 🤗 Transformers, more on it in the [docs](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert).**
23
+
24
+
25
+ # 🤗 Transformers usage
26
+
27
+ This is a bare checkpoint without any modeling head, and thus requires finetuning to be used for downstream tasks such as ASR. You can however use it to extract audio embeddings from the top layer with this code snippet:
28
+
29
+ ```python
30
+ from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
31
+ import torch
32
+ from datasets import load_dataset
33
+
34
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
35
+ dataset = dataset.sort("id")
36
+ sampling_rate = dataset.features["audio"].sampling_rate
37
+
38
+ processor = AutoProcessor.from_pretrained("Aspik101/distil-whisper-large-v3-pl")
39
+ model = Wav2Vec2BertModel.from_pretrained("Aspik101/distil-whisper-large-v3-pl")
40
+
41
+ # audio file is decoded on the fly
42
+ inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
43
+ with torch.no_grad():
44
+ outputs = model(**inputs)
45
+ ```
46
+
47
+ To learn more about the model use, refer to the following resources:
48
+ - [its docs](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert)
49
+ - [a blog post showing how to fine-tune it on Mongolian ASR](https://huggingface.co/blog/fine-tune-w2v2-bert)
50
+ - [a training script example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py)
51
+
52
+
53
+ # Seamless Communication usage
54
+
55
+ This model can be used in [Seamless Communication](https://github.com/facebookresearch/seamless_communication), where it was released.
56
+
57
+ Here's how to make a forward pass through the voice encoder, after having completed the [installation steps](https://github.com/facebookresearch/seamless_communication?tab=readme-ov-file#installation):
58
+
59
+ ```python
60
+ import torch
61
+
62
+ from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
63
+ from fairseq2.memory import MemoryBlock
64
+ from fairseq2.nn.padding import get_seqs_and_padding_mask
65
+ from pathlib import Path
66
+ from seamless_communication.models.conformer_shaw import load_conformer_shaw_model
67
+
68
+
69
+ audio_wav_path, device, dtype = ...
70
+ audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
71
+ fbank_converter = WaveformToFbankConverter(
72
+ num_mel_bins=80,
73
+ waveform_scale=2**15,
74
+ channel_last=True,
75
+ standardize=True,
76
+ device=device,
77
+ dtype=dtype,
78
+ )
79
+ collater = Collater(pad_value=1)
80
+
81
+ model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
82
+ model.eval()
83
+
84
+ with Path(audio_wav_path).open("rb") as fb:
85
+ block = MemoryBlock(fb.read())
86
+
87
+ decoded_audio = audio_decoder(block)
88
+ src = collater(fbank_converter(decoded_audio))["fbank"]
89
+ seqs, padding_mask = get_seqs_and_padding_mask(src)
90
+
91
+ with torch.inference_mode():
92
+ seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
93
+ seqs, padding_mask = model.encoder(seqs, padding_mask)
94
+ ```