Spaces:

projecte-aina
/

matxa-alvocat-tts-ca

Running

AlexK-PL commited on Apr 19

Commit

135c7c3

•

1 Parent(s): e46d17d

Update General Model Description (#5)

- Update General Model Description (6b18a933ca2062ba97d70cbf1474ffb07770bf66)

Co-authored-by: Alex Peiró Lilja <[email protected]>

Files changed (1) hide show

about.md CHANGED Viewed

@@ -18,13 +18,18 @@ Here you'll be able to find all the information regarding our model, which has b
 ## General Model Description
-**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS.
-On the one hand, the encoder part is based on a text encoder and a phoneme duration prediction. Together, they predict averaged acoustic features.
-On the other hand, the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture.
-In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved.
-**Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
-This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
 ## Adaptation to Catalan

 ## General Model Description
+**Matcha-TTS** is a non-autorregressive encoder-decoder model designed for fast acoustic modelling in TTS.
+The encoder part processes input sequences of phonemes and, together with a phoneme duration predictor, outputs averaged acoustic features. And the decoder,
+which is essentially a U-Net backbone based on the Transfomer architecture, predicts the refined spectrogram.
+The model is trained with optimal-transport conditional flow matching.
+This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps.
+**Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features.
+Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain.
+Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.
+The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the acoustic output of several TTS models.
+This version is tailored for the Catalan language, as it was trained only on Catalan speech datasets.
 ## Adaptation to Catalan