metadata

title: WavJourney
emoji: 🔥
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: cc-by-nc-nd-4.0

🎵 WavJourney: Compositional Audio Creation with LLMs

This repository contains the official implementation of "WavJourney: Compositional Audio Creation with Large Language Models".

Starting with a text prompt, WavJourney can create audio content with engaging storylines encompassing personalized speakers, lifelike speech in context, emotionally resonant music compositions, and impactful sound effects that enhance the auditory experience. Check the audio examples in the Project Page!

Preliminaries

Install the environment:

bash ./scripts/EnvsSetup.sh

Activate the conda environment:

conda activate WavJourney

Set your OpenAI-Key in config.yaml for accessing GPT-4 API [Guidance]. Please make sure the 'Service-Port' is not occupied. You can also modify the configuration, check the details described in the configuration file.
Pre-download the models (might take some time):

python scripts/download_models.py

Start Python API services (e.g., Text-to-Speech, Text-to-Audio)

bash scripts/start_services.sh

Web APP

bash scripts/start_ui.sh

Commandline Usage

python wavjourney_cli.py -f --input-text "Generate a one-minute introduction to quantum mechanics"

Kill the services

You can kill the running services via this command:

python scripts/kill_services.py

(Advanced features) Speaker customization

You can add voice presets to WavJourney to customize the voice actors. Simply provide the voice id, the description and a sample wav file, and WavJourney will pick the voice automatically based on the audio script. Predefined system voice presets are in data/voice_presets.

You can manage voice presets via UI. Specifically, if you want to add voice to voice presets. Run the script via command line below:

python add_voice_preset.py --id "id" --desc "description" --wav-path path/to/wav --session-id ''

What makes for good voice prompt? See detailed instructions here.

Hardware requirement

The VRAM of the GPU in the default configuration should be greater than 16 GB.
Operation system: Linux.

Citation

If you find this work useful, you can cite the paper below:

@article{liu2023wavjourney,
    title   = {WavJourney: Compositional Audio Creation with Large Language Models},
    author  = {Liu, Xubo and Zhu, Zhongkai and Liu, Haohe and Yuan, Yi and Huang, Qiushi and Liang, Jinhua and Cao, Yin and Kong, Qiuqiang and Plumbley, Mark D and Wang, Wenwu},
    journal = {arXiv preprint arXiv:2307.14335},
    year    = {2023}
}

Appreciation

Bark for a zero-shot text-to-speech synthesis model.
AudioCraft for state-of-the-art audio generation models.

Disclaimer

We are not responsible for audio generated using semantics created by this model. Just don't use it for illegal purposes.