anhnct
/

audioldm2_gigaspeech

Diffusers

AudioLDM2Pipeline

Model card Files Files and versions Community

anhnct commited on Oct 17, 2023

Commit

21f37ca

•

1 Parent(s): df84f99

Create README.md

Browse files

Files changed (1) hide show

README.md +100 -0

README.md ADDED Viewed

	@@ -0,0 +1,100 @@

+---
+license: creativeml-openrail-m
+---
+---
+license: cc-by-nc-nd-4.0
+---
+# AudioLDM 2
+AudioLDM 2 is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input.
+It is available in the 🧨 Diffusers library from v0.21.0 onwards.
+# Model Details
+AudioLDM 2 was proposed in the paper [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) by Haohe Liu et al.
+AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects,
+human speech and music.
+# Checkpoint Details
+This is the original, **base** version of the AudioLDM 2 model, also referred to as **audioldm2-full**.
+There are three official AudioLDM 2 checkpoints. Two of these checkpoints are applicable to the general task of text-to-audio
+generation. The third checkpoint is trained exclusively on text-to-music generation. All checkpoints share the same
+model size for the text encoders and VAE. They differ in the size and depth of the UNet. See table below for details on
+the three official checkpoints:
+| Checkpoint                                                      | Task          | UNet Model Size | Total Model Size | Training Data / h |
+|-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
+| [audioldm2](https://huggingface.co/cvssp/audioldm2)             | Text-to-audio | 350M            | 1.1B             | 1150k             |
+| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M            | 1.5B             | 1150k             |
+| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M            | 1.1B             | 665k              |
+| [audioldm2-gigaspeech](https://huggingface.co/anhnct/audioldm2_gigaspeech) | Text-to-speech | 350M            | 1.1B             |10k              |
+| [audioldm2-ljspeech](https://huggingface.co/anhnct/audioldm2_ljspeech) | Text-to-speech | 350M            | 1.1B             |              |
+## Model Sources
+- [**Original Repository**](https://github.com/haoheliu/audioldm2)
+- [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2)
+- [**Paper**](https://arxiv.org/abs/2308.05734)
+- [**Demo**](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)
+# Usage
+First, install the required packages:
+```
+pip install --upgrade diffusers transformers accelerate
+```
+## Text-to-Speech
+For text-to-speech generation, the [AudioLDM2Pipeline](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2) can be
+used to load pre-trained weights and generate text-conditional audio outputs:
+```python
+import scipy
+import torch
+from diffusers import AudioLDM2Pipeline
+repo_id = "anhnct/audioldm2_gigaspeech"
+pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+# define the prompts
+prompt = "An female actor say with angry voice"
+transcript = "wish you have a good day, i hope you never forget me"
+negative_prompt = "low quality"
+# set the seed for generator
+generator = torch.Generator("cuda").manual_seed(1)
+# run the generation
+audio = pipe(
+    prompt,
+    negative_prompt=negative_prompt,
+    transcription=transcript_1,
+    num_inference_steps=200,
+    audio_length_in_s=8.0,
+    num_waveforms_per_prompt=1,
+    generator=generator,
+    max_new_tokens=512
+).audios
+# save the best audio sample (index 0) as a .wav file
+scipy.io.wavfile.write("techno_2.wav", rate=16000, data=audio[0])
+```
+# Citation
+**BibTeX:**
+```
+@article{liu2023audioldm2,
+  title={"AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining"},
+  author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
+  journal={arXiv preprint arXiv:2308.05734},
+  year={2023}
+}
+```