Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization

🎵 We developed Tango 2 building upon Tango for text-to-audio generation. Tango 2 was initialized with the Tango-full-ft checkpoint and underwent alignment training using DPO on audio-alpaca, a pairwise text-to-audio preference dataset. Tango-2-full was trained on an extended version of Audio-alpaca 🎶

Read the paper

Code

Our code is released here: https://github.com/declare-lab/tango

Please follow the instructions in the repository for installation, usage and experiments.

Quickstart Guide

Download the Tango 2 model and generate audio from a text prompt:

import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango2-full")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.

The generate function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.

prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)

Use the generate_for_batch function to generate multiple audio samples for a batch of text prompts:

prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)

This will generate two samples for each of the three text prompts.

declare-lab
/

tango2-full

Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization

Code

Quickstart Guide

Dataset used to train declare-lab/tango2-full

Space using declare-lab/tango2-full 1