wetdog martillopartbsc commited on
Commit
1c00c50
1 Parent(s): c5e2c91

Alex changes (#9)

Browse files

- Alex changes (0f00fe24acfdc5543db161d6c2e048e876ce3ca3)


Co-authored-by: Martí Llopart Font <[email protected]>

Files changed (1) hide show
  1. about.md +11 -7
about.md CHANGED
@@ -207,13 +207,17 @@ Together, these technologies form a comprehensive TTS solution tailored to the n
207
 
208
  ## The model in detail
209
 
210
- **Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS.
211
- On the one hand, the encoder part is based on a text encoder and a phoneme duration prediction. Together, they predict averaged acoustic features.
212
- On the other hand, the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture.
213
- In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved.
214
-
215
- **Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
216
- This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
 
 
 
 
217
 
218
  ## Adaptation to Catalan
219
 
 
207
 
208
  ## The model in detail
209
 
210
+ **Matcha-TTS** is a non-autorregressive encoder-decoder model designed for fast acoustic modelling in TTS.
211
+ The encoder part processes input sequences of phonemes and, together with a phoneme duration predictor, outputs averaged acoustic features. And the decoder,
212
+ which is essentially a U-Net backbone based on the Transfomer architecture, predicts the refined spectrogram.
213
+ The model is trained with optimal-transport conditional flow matching.
214
+ This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps.
215
+
216
+ **Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features.
217
+ Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain.
218
+ Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.
219
+ The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the acoustic output of several TTS models.
220
+ This version is tailored for the Catalan language, as it was trained only on Catalan speech datasets.
221
 
222
  ## Adaptation to Catalan
223