File size: 2,139 Bytes

---
license: mit
tags:
- speech
- text
- cross-modal
- unified model
- self-supervised learning
- SpeechT5
- Text-to-Speech
datasets:
- LibriTTS
pipeline_tag: text-to-speech
---

## SpeechT5 TTS Manifest

| [**Github**](https://github.com/microsoft/SpeechT5) | [**Huggingface**](https://huggingface.co/mechanicalsea/speecht5-tts) |

This manifest is an attempt to recreate the Text-to-Speech recipe used for training [SpeechT5](https://aclanthology.org/2022.acl-long.393). This manifest was constructed using [LibriTTS](http://www.openslr.org/60/) clean datasets, including train-clean-100 and train-clean-360 for training, dev-clean for validation, and test-clean for evaluation. The test-clean-200 contains 200 utterances id for the mean option score (MOS), and the comparison mean option score (CMOS).

### Requirements

- [SpeechBrain](https://github.com/speechbrain/speechbrain) for extracting speaker embedding
- [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) for implementing vocoder.

### Tools

- `manifest/utils` is used to downsample waveform, extract speaker embedding, generate manifest, and apply vocoder.
- `pretrained_vocoder` provides the pre-trained vocoder.

### Model and Samples

- [`speecht5_tts.pt`](./speecht5_tts.pt) are reimplemented Text-to-Speech fine-tuning on the released manifest **but with a smaller batch size or max updates** (Ensure the manifest is ok).
- `samples` are created by the released fine-tuned model and vocoder.

### Reference

If you find our work is useful in your research, please cite the following paper:

```bibtex
@inproceedings{ao-etal-2022-speecht5,
    title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
    author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    month = {May},
    year = {2022},
    pages={5723--5738},
}
```