File size: 5,490 Bytes
0f75c5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb90476
0f75c5c
 
c0ab532
0f75c5c
 
 
 
c0ab532
0f75c5c
b498358
 
0f75c5c
 
 
 
 
 
cb90476
0f75c5c
 
c0ab532
0f75c5c
c0ab532
 
0f75c5c
 
 
 
 
c0ab532
0f75c5c
c0ab532
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f75c5c
c0ab532
 
0f75c5c
 
 
 
cb90476
0f75c5c
 
 
c0ab532
b498358
0f75c5c
 
 
 
 
b498358
0f75c5c
 
c0ab532
b498358
c0ab532
 
0f75c5c
c0ab532
 
0f75c5c
 
 
 
 
c0ab532
b498358
c0ab532
0f75c5c
c0ab532
 
0f75c5c
 
 
 
 
 
cb90476
0f75c5c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
inference: false
tags:
- SeamlessM4T
- seamless_m4t
license: cc-by-nc-4.0
library_name: transformers
---

# SeamlessM4T Large

SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different 
linguistic communities to communicate effortlessly through speech and text. 

SeamlessM4T Large covers:
- 📥 101 languages for speech input
- ⌨️ [96 Languages](https://huggingface.co/ylacombe/hf-seamless-m4t-large/blob/main/generation_config.json#L48-L145) for text input/output
- 🗣️ [35 languages](https://huggingface.co/ylacombe/hf-seamless-m4t-large/blob/main/generation_config.json#L149-L184) for speech output. 

This is the "large" variant of the unified model, which enables multiple tasks without relying on multiple separate models:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)

You can perform all the above tasks from one single model, [`SeamlessM4TModel`](https://huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel), but each task also has its own dedicated sub-model.


## 🤗 Usage

First, load the processor and a checkpoint of the model:

```python
from transformers import AutoProcessor, SeamlessM4TModel

processor = AutoProcessor.from_pretrained("ylacombe/hf-seamless-m4t-large")
model = SeamlessM4TModel.from_pretrained("ylacombe/hf-seamless-m4t-large")
```

You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.

### Speech

You can easily generate translated speech with [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate). Here is an example showing how to generate speech from English to Russian.

```python
inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")

audio_array = model.generate(**inputs, tgt_lang="rus")
audio_array = audio_array[0].cpu().numpy().squeeze()
```

You can also translate directly from a speech waveform. Here is an example from Arabic to English:

```python
from datasets import load_dataset

dataset = load_dataset("arabic_speech_corpus", split="test[0:1]")
audio_sample = dataset["audio"][0]["array"]
inputs = processor(audios = audio_sample, return_tensors="pt")

audio_array = model.generate(**inputs, tgt_lang="rus")
audio_array = audio_array[0].cpu().numpy().squeeze()
```

Listen to the speech samples either in an ipynb notebook:

```python
from IPython.display import Audio

sampling_rate = model.config.sample_rate
Audio(audio_array, rate=sampling_rate)
```

Or save them as a `.wav` file using a third-party library, e.g. `scipy`:

```python
import scipy

sampling_rate = model.config.sample_rate
scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sampling_rate, data=audio_array)
```

#### Tips

[`SeamlessM4TModel`](https://huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
For example, you can replace the previous snippet with the model dedicated to the S2ST task:

```python
from transformers import SeamlessM4TForSpeechToSpeech
model = SeamlessM4TForSpeechToSpeech.from_pretrained("ylacombe/hf-seamless-m4t-large")
```


### Text

Similarly, you can generate translated text from text or audio files. This time, let's use the dedicated models as example.

```python
from transformers import SeamlessM4TForSpeechToText
model = SeamlessM4TForSpeechToText.from_pretrained("ylacombe/hf-seamless-m4t-large")
audio_sample = dataset["audio"][0]["array"]
inputs = processor(audios = audio_sample, return_tensors="pt")

output_tokens = model.generate(**inputs, tgt_lang="fra")
translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
```

And from text:

```python
from transformers import SeamlessM4TForTextToText
model = SeamlessM4TForTextToText.from_pretrained("ylacombe/hf-seamless-m4t-large")
inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")

output_tokens = model.generate(**inputs, tgt_lang="fra")
translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
```

#### Tips

Three last tips:

1. [`SeamlessM4TModel`](https://huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) can generate text and/or speech. Pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/pr_25693/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate) to only generate text. You also have the possibility to pass `return_intermediate_token_ids=True`, to get both text token ids and the generated speech.
2. You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument.
3. You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.