Audio Diffusion
Overview
Audio Diffusion by Robert Dargavel Smith.
Audio Diffusion leverages the recent advances in image generation using diffusion models by converting audio samples to and from mel spectrogram images.
The original codebase of this implementation can be found here, including training scripts and example notebooks.
Available Pipelines:
Pipeline | Tasks | Colab |
---|---|---|
pipeline_audio_diffusion.py | Unconditional Audio Generation |
Examples:
Audio Diffusion
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
Latent Audio Diffusion
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device)
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
Audio Diffusion with DDIM (faster)
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-ddim-256").to(device)
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
Variations, in-painting, out-painting etc.
output = pipe(
raw_audio=output.audios[0, 0],
start_step=int(pipe.get_default_steps() / 2),
mask_start_secs=1,
mask_end_secs=1,
)
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
AudioDiffusionPipeline
class diffusers.AudioDiffusionPipeline
< source >( vqvae: AutoencoderKL unet: UNet2DConditionModel mel: Mel scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_ddpm.DDPMScheduler] )
Parameters
- vqae (AutoencoderKL) — Variational AutoEncoder for Latent Audio Diffusion or None
- unet (UNet2DConditionModel) — UNET model
- mel (Mel) — transform audio <-> spectrogram
-
scheduler ([
DDIMScheduler
orDDPMScheduler
]) — de-noising scheduler
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >(
batch_size: int = 1
audio_file: str = None
raw_audio: ndarray = None
slice: int = 0
start_step: int = 0
steps: int = None
generator: Generator = None
mask_start_secs: float = 0
mask_end_secs: float = 0
step_generator: Generator = None
eta: float = 0
noise: Tensor = None
encoding: Tensor = None
return_dict = True
)
β
List[PIL Image]
Parameters
-
batch_size (
int
) — number of samples to generate -
audio_file (
str
) — must be a file on disk due to Librosa limitation or -
raw_audio (
np.ndarray
) — audio as numpy array -
slice (
int
) — slice number of audio to convert - start_step (int) — step to start from
-
steps (
int
) — number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM) -
generator (
torch.Generator
) — random number generator or None -
mask_start_secs (
float
) — number of seconds of audio to mask (not generate) at start -
mask_end_secs (
float
) — number of seconds of audio to mask (not generate) at end -
step_generator (
torch.Generator
) — random number generator used to de-noise or None -
eta (
float
) — parameter between 0 and 1 used with DDIM scheduler -
noise (
torch.Tensor
) — noise tensor of shape (batch_size, 1, height, width) or None -
encoding (
torch.Tensor
) — for UNet2DConditionModel shape (batch_size, seq_length, cross_attention_dim) -
return_dict (
bool
) — if True return AudioPipelineOutput, ImagePipelineOutput else Tuple
Returns
List[PIL Image]
mel spectrograms (float
, List[np.ndarray]
): sample rate and raw audios
Generate random mel spectrogram from audio input and convert to audio.
encode
< source >(
images: typing.List[PIL.Image.Image]
steps: int = 50
)
β
np.ndarray
Reverse step process: recover noisy image from generated image.
Returns default number of steps recommended for inference
slerp
< source >(
x0: Tensor
x1: Tensor
alpha: float
)
β
torch.Tensor
Spherical Linear intERPolation
Mel
class diffusers.Mel
< source >( x_res: int = 256 y_res: int = 256 sample_rate: int = 22050 n_fft: int = 2048 hop_length: int = 512 top_db: int = 80 n_iter: int = 32 )
Parameters
-
x_res (
int
) — x resolution of spectrogram (time) -
y_res (
int
) — y resolution of spectrogram (frequency bins) -
sample_rate (
int
) — sample rate of audio -
n_fft (
int
) — number of Fast Fourier Transforms -
hop_length (
int
) — hop length (a higher number is recommended for lower than 256 y_res) -
top_db (
int
) — loudest in decibels -
n_iter (
int
) — number of iterations for Griffin Linn mel inversion
audio_slice_to_image
< source >(
slice: int
)
β
PIL Image
Convert slice of audio to spectrogram.
get_audio_slice
< source >(
slice: int = 0
)
β
np.ndarray
Get slice of audio.
get_number_of_slices
< source >(
)
β
int
Returns
int
number of spectograms audio can be sliced into
Get number of slices in audio.
Get sample rate:
image_to_audio
< source >(
image: Image
)
β
audio (np.ndarray
)
Converts spectrogram to audio.
load_audio
< source >( audio_file: str = None raw_audio: ndarray = None )
Load audio.
set_resolution
< source >( x_res: int y_res: int )
Set resolution.