Diffusers
AudioLDM2Pipeline
anhnct commited on
Commit
21f37ca
1 Parent(s): df84f99

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -0
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: creativeml-openrail-m
3
+ ---
4
+ ---
5
+ license: cc-by-nc-nd-4.0
6
+ ---
7
+
8
+ # AudioLDM 2
9
+
10
+ AudioLDM 2 is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input.
11
+ It is available in the 🧨 Diffusers library from v0.21.0 onwards.
12
+
13
+ # Model Details
14
+
15
+ AudioLDM 2 was proposed in the paper [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) by Haohe Liu et al.
16
+
17
+ AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects,
18
+ human speech and music.
19
+
20
+ # Checkpoint Details
21
+
22
+ This is the original, **base** version of the AudioLDM 2 model, also referred to as **audioldm2-full**.
23
+
24
+ There are three official AudioLDM 2 checkpoints. Two of these checkpoints are applicable to the general task of text-to-audio
25
+ generation. The third checkpoint is trained exclusively on text-to-music generation. All checkpoints share the same
26
+ model size for the text encoders and VAE. They differ in the size and depth of the UNet. See table below for details on
27
+ the three official checkpoints:
28
+
29
+ | Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h |
30
+ |-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
31
+ | [audioldm2](https://huggingface.co/cvssp/audioldm2) | Text-to-audio | 350M | 1.1B | 1150k |
32
+ | [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M | 1.5B | 1150k |
33
+ | [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M | 1.1B | 665k |
34
+ | [audioldm2-gigaspeech](https://huggingface.co/anhnct/audioldm2_gigaspeech) | Text-to-speech | 350M | 1.1B |10k |
35
+ | [audioldm2-ljspeech](https://huggingface.co/anhnct/audioldm2_ljspeech) | Text-to-speech | 350M | 1.1B | |
36
+
37
+ ## Model Sources
38
+
39
+ - [**Original Repository**](https://github.com/haoheliu/audioldm2)
40
+ - [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2)
41
+ - [**Paper**](https://arxiv.org/abs/2308.05734)
42
+ - [**Demo**](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)
43
+
44
+ # Usage
45
+
46
+ First, install the required packages:
47
+
48
+ ```
49
+ pip install --upgrade diffusers transformers accelerate
50
+ ```
51
+
52
+ ## Text-to-Speech
53
+
54
+ For text-to-speech generation, the [AudioLDM2Pipeline](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2) can be
55
+ used to load pre-trained weights and generate text-conditional audio outputs:
56
+
57
+ ```python
58
+ import scipy
59
+ import torch
60
+ from diffusers import AudioLDM2Pipeline
61
+
62
+ repo_id = "anhnct/audioldm2_gigaspeech"
63
+ pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
64
+ pipe = pipe.to("cuda")
65
+
66
+ # define the prompts
67
+ prompt = "An female actor say with angry voice"
68
+ transcript = "wish you have a good day, i hope you never forget me"
69
+ negative_prompt = "low quality"
70
+
71
+ # set the seed for generator
72
+ generator = torch.Generator("cuda").manual_seed(1)
73
+
74
+ # run the generation
75
+ audio = pipe(
76
+ prompt,
77
+ negative_prompt=negative_prompt,
78
+ transcription=transcript_1,
79
+ num_inference_steps=200,
80
+ audio_length_in_s=8.0,
81
+ num_waveforms_per_prompt=1,
82
+ generator=generator,
83
+ max_new_tokens=512
84
+ ).audios
85
+
86
+ # save the best audio sample (index 0) as a .wav file
87
+ scipy.io.wavfile.write("techno_2.wav", rate=16000, data=audio[0])
88
+ ```
89
+
90
+ # Citation
91
+
92
+ **BibTeX:**
93
+ ```
94
+ @article{liu2023audioldm2,
95
+ title={"AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining"},
96
+ author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
97
+ journal={arXiv preprint arXiv:2308.05734},
98
+ year={2023}
99
+ }
100
+ ```