AudioLDM 2
AudioLDM 2 was proposed in AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
Inspired by Stable Diffusion, AudioLDM 2 is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from text embeddings. Two text encoder models are used to compute the text embeddings from a prompt input: the text-branch of CLAP and the encoder of Flan-T5. These text embeddings are then projected to a shared embedding space by an AudioLDM2ProjectionModel. A GPT2 language model (LM) is used to auto-regressively predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The UNet of AudioLDM 2 is unique in the sense that it takes two cross-attention embeddings, as opposed to one cross-attention conditioning, as in most other LDMs.
The abstract of the paper is the following:
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at this https URL.
This pipeline was contributed by sanchit-gandhi and Nguyễn Công Tú Anh. The original codebase can be found at haoheliu/audioldm2.
Tips
Choosing a checkpoint
AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation.
All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet. See table below for details on the three checkpoints:
Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h |
---|---|---|---|---|
audioldm2 | Text-to-audio | 350M | 1.1B | 1150k |
audioldm2-large | Text-to-audio | 750M | 1.5B | 1150k |
audioldm2-music | Text-to-music | 350M | 1.1B | 665k |
audioldm2-gigaspeech | Text-to-speech | 350M | 1.1B | 10k |
audioldm2-ljspeech | Text-to-speech | 350M | 1.1B |
Constructing a prompt
- Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. “high quality” or “clear”) and make the prompt context specific (e.g. “water stream in a forest” instead of “stream”).
- It’s best to use general terms like “cat” or “dog” instead of specific names or abstract objects the model may not be familiar with.
- Using a negative prompt can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of “Low quality.”
Controlling inference
- The quality of the predicted audio sample can be controlled by the
num_inference_steps
argument; higher steps give higher quality audio at the expense of slower inference. - The length of the predicted audio sample can be controlled by varying the
audio_length_in_s
argument.
Evaluating generated waveforms:
- The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation.
- Multiple waveforms can be generated in one go: set
num_waveforms_per_prompt
to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
The following example demonstrates how to construct good music and speech generation using the aforementioned tips: example.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
AudioLDM2Pipeline
class diffusers.AudioLDM2Pipeline
< source >( vae: AutoencoderKL text_encoder: ClapModel text_encoder_2: typing.Union[transformers.models.t5.modeling_t5.T5EncoderModel, transformers.models.vits.modeling_vits.VitsModel] projection_model: AudioLDM2ProjectionModel language_model: GPT2Model tokenizer: typing.Union[transformers.models.roberta.tokenization_roberta.RobertaTokenizer, transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast] tokenizer_2: typing.Union[transformers.models.t5.tokenization_t5.T5Tokenizer, transformers.models.t5.tokenization_t5_fast.T5TokenizerFast, transformers.models.vits.tokenization_vits.VitsTokenizer] feature_extractor: ClapFeatureExtractor unet: AudioLDM2UNet2DConditionModel scheduler: KarrasDiffusionSchedulers vocoder: SpeechT5HifiGan )
Parameters
- vae (AutoencoderKL) — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
- text_encoder (ClapModel) — First frozen text-encoder. AudioLDM2 uses the joint audio-text embedding model CLAP, specifically the laion/clap-htsat-unfused variant. The text branch is used to encode the text prompt to a prompt embedding. The full audio-text model is used to rank generated waveforms against the text prompt by computing similarity scores.
- text_encoder_2 ([
~transformers.T5EncoderModel
,~transformers.VitsModel
]) — Second frozen text-encoder. AudioLDM2 uses the encoder of T5, specifically the google/flan-t5-large variant. Second frozen text-encoder use for TTS. AudioLDM2 uses the encoder of Vits. - projection_model (AudioLDM2ProjectionModel) — A trained model used to linearly project the hidden-states from the first and second text encoder models and insert learned SOS and EOS token embeddings. The projected hidden-states from the two text encoders are concatenated to give the input to the language model. A Learned Position Embedding for the Vits hidden-states
- language_model (GPT2Model) — An auto-regressive language model used to generate a sequence of hidden-states conditioned on the projected outputs from the two text encoders.
- tokenizer (RobertaTokenizer) — Tokenizer to tokenize text for the first frozen text-encoder.
- tokenizer_2 ([
~transformers.T5Tokenizer
,~transformers.VitsTokenizer
]) — Tokenizer to tokenize text for the second frozen text-encoder. - feature_extractor (ClapFeatureExtractor) — Feature extractor to pre-process generated audio waveforms to log-mel spectrograms for automatic scoring.
- unet (UNet2DConditionModel) —
A
UNet2DConditionModel
to denoise the encoded audio latents. - scheduler (SchedulerMixin) —
A scheduler to be used in combination with
unet
to denoise the encoded audio latents. Can be one of DDIMScheduler, LMSDiscreteScheduler, or PNDMScheduler. - vocoder (SpeechT5HifiGan) —
Vocoder of class
SpeechT5HifiGan
to convert the mel-spectrogram latents to the final audio waveform.
Pipeline for text-to-audio generation using AudioLDM2.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None transcription: typing.Union[str, typing.List[str]] = None audio_length_in_s: typing.Optional[float] = None num_inference_steps: int = 200 guidance_scale: float = 3.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_waveforms_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None generated_prompt_embeds: typing.Optional[torch.Tensor] = None negative_generated_prompt_embeds: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.LongTensor] = None negative_attention_mask: typing.Optional[torch.LongTensor] = None max_new_tokens: typing.Optional[int] = None return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: typing.Optional[int] = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None output_type: typing.Optional[str] = 'np' ) → StableDiffusionPipelineOutput or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide audio generation. If not defined, you need to passprompt_embeds
. - transcription (
str
orList[str]
, optional) —\ The transcript for text to speech. - audio_length_in_s (
int
, optional, defaults to 10.24) — The length of the generated audio sample in seconds. - num_inference_steps (
int
, optional, defaults to 200) — The number of denoising steps. More denoising steps usually lead to a higher quality audio at the expense of slower inference. - guidance_scale (
float
, optional, defaults to 3.5) — A higher guidance scale value encourages the model to generate audio that is closely linked to the textprompt
at the expense of lower sound quality. Guidance scale is enabled whenguidance_scale > 1
. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts to guide what to not include in audio generation. If not defined, you need to passnegative_prompt_embeds
instead. Ignored when not using guidance (guidance_scale < 1
). - num_waveforms_per_prompt (
int
, optional, defaults to 1) — The number of waveforms to generate per prompt. Ifnum_waveforms_per_prompt > 1
, then automatic scoring is performed between the generated outputs and the text prompt. This scoring ranks the generated waveforms based on their cosine similarity with the text input in the joint text-audio embedding space. - eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the DDIMScheduler, and is ignored in other schedulers. - generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for spectrogram generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embeds
are generated from thenegative_prompt
input argument. - generated_prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings from the GPT2 langauge model. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_generated_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings from the GPT2 language model. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be computed fromnegative_prompt
input argument. - attention_mask (
torch.LongTensor
, optional) — Pre-computed attention mask to be applied to theprompt_embeds
. If not provided, attention mask will be computed fromprompt
input argument. - negative_attention_mask (
torch.LongTensor
, optional) — Pre-computed attention mask to be applied to thenegative_prompt_embeds
. If not provided, attention mask will be computed fromnegative_prompt
input argument. - max_new_tokens (
int
, optional, defaults to None) — Number of new tokens to generate with the GPT2 language model. If not provided, number of tokens will be taken from the config of the model. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple. - callback (
Callable
, optional) — A function that calls everycallback_steps
steps during inference. The function is called with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function is called. If not specified, the callback is called at every step. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. - output_type (
str
, optional, defaults to"np"
) — The output format of the generated audio. Choose between"np"
to return a NumPynp.ndarray
or"pt"
to return a PyTorchtorch.Tensor
object. Set to"latent"
to return the latent diffusion model (LDM) output.
Returns
StableDiffusionPipelineOutput or tuple
If return_dict
is True
, StableDiffusionPipelineOutput is returned,
otherwise a tuple
is returned where the first element is a list with the generated audio.
The call function to the pipeline for generation.
Examples:
>>> import scipy
>>> import torch
>>> from diffusers import AudioLDM2Pipeline
>>> repo_id = "cvssp/audioldm2"
>>> pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> # define the prompts
>>> prompt = "The sound of a hammer hitting a wooden surface."
>>> negative_prompt = "Low quality."
>>> # set the seed for generator
>>> generator = torch.Generator("cuda").manual_seed(0)
>>> # run the generation
>>> audio = pipe(
... prompt,
... negative_prompt=negative_prompt,
... num_inference_steps=200,
... audio_length_in_s=10.0,
... num_waveforms_per_prompt=3,
... generator=generator,
... ).audios
>>> # save the best audio sample (index 0) as a .wav file
>>> scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
#Using AudioLDM2 for Text To Speech
>>> import scipy
>>> import torch
>>> from diffusers import AudioLDM2Pipeline
>>> repo_id = "anhnct/audioldm2_gigaspeech"
>>> pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> # define the prompts
>>> prompt = "A female reporter is speaking"
>>> transcript = "wish you have a good day"
>>> # set the seed for generator
>>> generator = torch.Generator("cuda").manual_seed(0)
>>> # run the generation
>>> audio = pipe(
... prompt,
... transcription=transcript,
... num_inference_steps=200,
... audio_length_in_s=10.0,
... num_waveforms_per_prompt=2,
... generator=generator,
... max_new_tokens=512, #Must set max_new_tokens equa to 512 for TTS
... ).audios
>>> # save the best audio sample (index 0) as a .wav file
>>> scipy.io.wavfile.write("tts.wav", rate=16000, data=audio[0])
Disable sliced VAE decoding. If enable_vae_slicing
was previously enabled, this method will go back to
computing decoding in one step.
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared
to enable_sequential_cpu_offload
, this method moves one whole model at a time to the GPU when its forward
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with
enable_sequential_cpu_offload
, but performance is much better due to the iterative execution of the unet
.
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
encode_prompt
< source >( prompt device num_waveforms_per_prompt do_classifier_free_guidance transcription = None negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None generated_prompt_embeds: typing.Optional[torch.Tensor] = None negative_generated_prompt_embeds: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.LongTensor] = None negative_attention_mask: typing.Optional[torch.LongTensor] = None max_new_tokens: typing.Optional[int] = None ) → prompt_embeds (torch.Tensor
)
Parameters
- prompt (
str
orList[str]
, optional) — prompt to be encoded - transcription (
str
orList[str]
) — transcription of text to speech - device (
torch.device
) — torch device - num_waveforms_per_prompt (
int
) — number of waveforms that should be generated per prompt - do_classifier_free_guidance (
bool
) — whether to use classifier free guidance or not - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the audio generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - prompt_embeds (
torch.Tensor
, optional) — Pre-computed text embeddings from the Flan T5 model. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be computed fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-computed negative text embeddings from the Flan T5 model. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be computed fromnegative_prompt
input argument. - generated_prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings from the GPT2 langauge model. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_generated_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings from the GPT2 language model. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be computed fromnegative_prompt
input argument. - attention_mask (
torch.LongTensor
, optional) — Pre-computed attention mask to be applied to theprompt_embeds
. If not provided, attention mask will be computed fromprompt
input argument. - negative_attention_mask (
torch.LongTensor
, optional) — Pre-computed attention mask to be applied to thenegative_prompt_embeds
. If not provided, attention mask will be computed fromnegative_prompt
input argument. - max_new_tokens (
int
, optional, defaults to None) — The number of new tokens to generate with the GPT2 language model.
Returns
prompt_embeds (torch.Tensor
)
Text embeddings from the Flan T5 model.
attention_mask (torch.LongTensor
):
Attention mask to be applied to the prompt_embeds
.
generated_prompt_embeds (torch.Tensor
):
Text embeddings generated from the GPT2 langauge model.
Encodes the prompt into text encoder hidden states.
Example:
>>> import scipy
>>> import torch
>>> from diffusers import AudioLDM2Pipeline
>>> repo_id = "cvssp/audioldm2"
>>> pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> # Get text embedding vectors
>>> prompt_embeds, attention_mask, generated_prompt_embeds = pipe.encode_prompt(
... prompt="Techno music with a strong, upbeat tempo and high melodic riffs",
... device="cuda",
... do_classifier_free_guidance=True,
... )
>>> # Pass text embeddings to pipeline for text-conditional audio generation
>>> audio = pipe(
... prompt_embeds=prompt_embeds,
... attention_mask=attention_mask,
... generated_prompt_embeds=generated_prompt_embeds,
... num_inference_steps=200,
... audio_length_in_s=10.0,
... ).audios[0]
>>> # save generated audio sample
>>> scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
generate_language_model
< source >( inputs_embeds: Tensor = None max_new_tokens: int = 8 **model_kwargs ) → inputs_embeds (
torch.Tensorof shape
(batch_size, sequence_length, hidden_size)`)
Parameters
- inputs_embeds (
torch.Tensor
of shape(batch_size, sequence_length, hidden_size)
) — The sequence used as a prompt for the generation. - max_new_tokens (
int
) — Number of new tokens to generate. - model_kwargs (
Dict[str, Any]
, optional) — Ad hoc parametrization of additional model-specific kwargs that will be forwarded to theforward
function of the model.
Returns
inputs_embeds (
torch.Tensorof shape
(batch_size, sequence_length, hidden_size)`)
The sequence of generated hidden-states.
Generates a sequence of hidden-states from the language model, conditioned on the embedding inputs.
AudioLDM2ProjectionModel
class diffusers.AudioLDM2ProjectionModel
< source >( text_encoder_dim text_encoder_1_dim langauge_model_dim use_learned_position_embedding = None max_seq_length = None )
Parameters
- text_encoder_dim (
int
) — Dimensionality of the text embeddings from the first text encoder (CLAP). - text_encoder_1_dim (
int
) — Dimensionality of the text embeddings from the second text encoder (T5 or VITS). - langauge_model_dim (
int
) — Dimensionality of the text embeddings from the language model (GPT2).
A simple linear projection model to map two text embeddings to a shared latent space. It also inserts learned
embedding vectors at the start and end of each text embedding sequence respectively. Each variable appended with
_1
refers to that corresponding to the second text encoder. Otherwise, it is from the first.
forward
< source >( hidden_states: typing.Optional[torch.Tensor] = None hidden_states_1: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.LongTensor] = None attention_mask_1: typing.Optional[torch.LongTensor] = None )
AudioLDM2UNet2DConditionModel
class diffusers.AudioLDM2UNet2DConditionModel
< source >( sample_size: typing.Optional[int] = None in_channels: int = 4 out_channels: int = 4 flip_sin_to_cos: bool = True freq_shift: int = 0 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') mid_block_type: typing.Optional[str] = 'UNetMidBlock2DCrossAttn' up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D') only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = False block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: typing.Union[int, typing.Tuple[int]] = 2 downsample_padding: int = 1 mid_block_scale_factor: float = 1 act_fn: str = 'silu' norm_num_groups: typing.Optional[int] = 32 norm_eps: float = 1e-05 cross_attention_dim: typing.Union[int, typing.Tuple[int]] = 1280 transformer_layers_per_block: typing.Union[int, typing.Tuple[int]] = 1 attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8 num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None use_linear_projection: bool = False class_embed_type: typing.Optional[str] = None num_class_embeds: typing.Optional[int] = None upcast_attention: bool = False resnet_time_scale_shift: str = 'default' time_embedding_type: str = 'positional' time_embedding_dim: typing.Optional[int] = None time_embedding_act_fn: typing.Optional[str] = None timestep_post_act: typing.Optional[str] = None time_cond_proj_dim: typing.Optional[int] = None conv_in_kernel: int = 3 conv_out_kernel: int = 3 projection_class_embeddings_input_dim: typing.Optional[int] = None class_embeddings_concat: bool = False )
Parameters
- sample_size (
int
orTuple[int, int]
, optional, defaults toNone
) — Height and width of input/output sample. - in_channels (
int
, optional, defaults to 4) — Number of channels in the input sample. - out_channels (
int
, optional, defaults to 4) — Number of channels in the output. - flip_sin_to_cos (
bool
, optional, defaults toFalse
) — Whether to flip the sin to cos in the time embedding. - freq_shift (
int
, optional, defaults to 0) — The frequency shift to apply to the time embedding. - down_block_types (
Tuple[str]
, optional, defaults to("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D")
) — The tuple of downsample blocks to use. - mid_block_type (
str
, optional, defaults to"UNetMidBlock2DCrossAttn"
) — Block type for middle of UNet, it can only beUNetMidBlock2DCrossAttn
for AudioLDM2. - up_block_types (
Tuple[str]
, optional, defaults to("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D")
) — The tuple of upsample blocks to use. - only_cross_attention (
bool
orTuple[bool]
, optional, default toFalse
) — Whether to include self-attention in the basic transformer blocks, seeBasicTransformerBlock
. - block_out_channels (
Tuple[int]
, optional, defaults to(320, 640, 1280, 1280)
) — The tuple of output channels for each block. - layers_per_block (
int
, optional, defaults to 2) — The number of layers per block. - downsample_padding (
int
, optional, defaults to 1) — The padding to use for the downsampling convolution. - mid_block_scale_factor (
float
, optional, defaults to 1.0) — The scale factor to use for the mid block. - act_fn (
str
, optional, defaults to"silu"
) — The activation function to use. - norm_num_groups (
int
, optional, defaults to 32) — The number of groups to use for the normalization. IfNone
, normalization and activation layers is skipped in post-processing. - norm_eps (
float
, optional, defaults to 1e-5) — The epsilon to use for the normalization. - cross_attention_dim (
int
orTuple[int]
, optional, defaults to 1280) — The dimension of the cross attention features. - transformer_layers_per_block (
int
orTuple[int]
, optional, defaults to 1) — The number of transformer blocks of typeBasicTransformerBlock
. Only relevant for~models.unet_2d_blocks.CrossAttnDownBlock2D
,~models.unet_2d_blocks.CrossAttnUpBlock2D
,~models.unet_2d_blocks.UNetMidBlock2DCrossAttn
. - attention_head_dim (
int
, optional, defaults to 8) — The dimension of the attention heads. - num_attention_heads (
int
, optional) — The number of attention heads. If not defined, defaults toattention_head_dim
- resnet_time_scale_shift (
str
, optional, defaults to"default"
) — Time scale shift config for ResNet blocks (seeResnetBlock2D
). Choose fromdefault
orscale_shift
. - class_embed_type (
str
, optional, defaults toNone
) — The type of class embedding to use which is ultimately summed with the time embeddings. Choose fromNone
,"timestep"
,"identity"
,"projection"
, or"simple_projection"
. - num_class_embeds (
int
, optional, defaults toNone
) — Input dimension of the learnable embedding matrix to be projected totime_embed_dim
, when performing class conditioning withclass_embed_type
equal toNone
. - time_embedding_type (
str
, optional, defaults topositional
) — The type of position embedding to use for timesteps. Choose frompositional
orfourier
. - time_embedding_dim (
int
, optional, defaults toNone
) — An optional override for the dimension of the projected time embedding. - time_embedding_act_fn (
str
, optional, defaults toNone
) — Optional activation function to use only once on the time embeddings before they are passed to the rest of the UNet. Choose fromsilu
,mish
,gelu
, andswish
. - timestep_post_act (
str
, optional, defaults toNone
) — The second activation function to use in timestep embedding. Choose fromsilu
,mish
andgelu
. - time_cond_proj_dim (
int
, optional, defaults toNone
) — The dimension ofcond_proj
layer in the timestep embedding. - conv_in_kernel (
int
, optional, default to3
) — The kernel size ofconv_in
layer. - conv_out_kernel (
int
, optional, default to3
) — The kernel size ofconv_out
layer. - projection_class_embeddings_input_dim (
int
, optional) — The dimension of theclass_labels
input whenclass_embed_type="projection"
. Required whenclass_embed_type="projection"
. - class_embeddings_concat (
bool
, optional, defaults toFalse
) — Whether to concatenate the time embeddings with the class embeddings.
A conditional 2D UNet model that takes a noisy sample, conditional state, and a timestep and returns a sample
shaped output. Compared to the vanilla UNet2DConditionModel, this variant optionally includes an additional
self-attention layer in each Transformer block, as well as multiple cross-attention layers. It also allows for up
to two cross-attention embeddings, encoder_hidden_states
and encoder_hidden_states_1
.
This model inherits from ModelMixin. Check the superclass documentation for it’s generic methods implemented for all models (such as downloading or saving).
forward
< source >( sample: Tensor timestep: typing.Union[torch.Tensor, float, int] encoder_hidden_states: Tensor class_labels: typing.Optional[torch.Tensor] = None timestep_cond: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None encoder_attention_mask: typing.Optional[torch.Tensor] = None return_dict: bool = True encoder_hidden_states_1: typing.Optional[torch.Tensor] = None encoder_attention_mask_1: typing.Optional[torch.Tensor] = None ) → UNet2DConditionOutput or tuple
Parameters
- sample (
torch.Tensor
) — The noisy input tensor with the following shape(batch, channel, height, width)
. - timestep (
torch.Tensor
orfloat
orint
) — The number of timesteps to denoise an input. - encoder_hidden_states (
torch.Tensor
) — The encoder hidden states with shape(batch, sequence_length, feature_dim)
. - encoder_attention_mask (
torch.Tensor
) — A cross-attention mask of shape(batch, sequence_length)
is applied toencoder_hidden_states
. IfTrue
the mask is kept, otherwise ifFalse
it is discarded. Mask will be converted into a bias, which adds large negative values to the attention scores corresponding to “discard” tokens. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a UNet2DConditionOutput instead of a plain tuple. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttnProcessor
. - encoder_hidden_states_1 (
torch.Tensor
, optional) — A second set of encoder hidden states with shape(batch, sequence_length_2, feature_dim_2)
. Can be used to condition the model on a different set of embeddings toencoder_hidden_states
. - encoder_attention_mask_1 (
torch.Tensor
, optional) — A cross-attention mask of shape(batch, sequence_length_2)
is applied toencoder_hidden_states_1
. IfTrue
the mask is kept, otherwise ifFalse
it is discarded. Mask will be converted into a bias, which adds large negative values to the attention scores corresponding to “discard” tokens.
Returns
UNet2DConditionOutput or tuple
If return_dict
is True, an UNet2DConditionOutput is returned,
otherwise a tuple
is returned where the first element is the sample tensor.
The AudioLDM2UNet2DConditionModel forward method.
AudioPipelineOutput
class diffusers.AudioPipelineOutput
< source >( audios: ndarray )
Output class for audio pipelines.