SEE-2-SOUND🔊: Zero-Shot Spatial Environment-to-Spatial Sound
Rishit Dagli1 · Shivesh Prakash1 · Rupert Wu1 · Houman Khosravani1,2,3
1University of Toronto 2Temerty Centre for Artificial Intelligence Research and Education in Medicine 3Sunnybrook Research Institute
This work presents SEE-2-SOUND, a method to generate spatial audio from images, animated images, and videos to accompany the visual content. Check out our website to view some results of this work.
These checkpoints are meant to be used with our code: SEE-2-SOUND.
Installation
First, install the pip package and download these checkpoints (needs Git LFS):
pip install -e git+https://github.com/see2sound/see2sound.git#egg=see2sound
git clone https://huggingface.co/rishitdagli/see-2-sound
cd see-2-sound
View the full installation instructions as well a tips on dependencies in the repository README.
Running the Models
Now, we can start by making a configuration file, make a file called config.yaml
:
codi_encoder: 'codi/codi_encoder.pth'
codi_text: 'codi/codi_text.pth'
codi_audio: 'codi/codi_audio.pth'
codi_video: 'codi/codi_video.pth'
sam: 'sam/sam.pth'
# H, L or B in decreasing performance
sam_size: 'H'
depth: '/depth/depth.pth'
# L, B, or S in decreasing performance
depth_size: 'L'
download: False
# Change to True if your GPU has < 40 GB vRAM
low_mem: False
fp16: False
gpu: True
steps: 500
num_audios: 3
prompt: ''
verbose: True
Now, we can start running inference:
import see2sound
config_file_path = "config.yaml"
model = see2sound.See2Sound(config_path = config_file_path)
model.setup()
model.run(path = "test.png", output_path = "test.wav")
More Information
Feel free to take a look at the full dcoumentation for extra information and tips on running the model.
Spaces using rishitdagli/see-2-sound 2
Collection including rishitdagli/see-2-sound
Evaluation results
- AViTAR Marginal Scene Guidance - Mel-Frequency Cepstral Coefficient - Dynamic Time Warping on SEE-2-SOUND Evaluation DatasetarXiv0.03 × 10^-3
- AViTAR Marginal Scene Guidance - Zero Crossing Rate on SEE-2-SOUND Evaluation DatasetarXiv0.950
- Chroma Feature on SEE-2-SOUND Evaluation DatasetarXiv0.770
- AViTAR Marginal Scene Guidance - Spectral Score on SEE-2-SOUND Evaluation DatasetarXiv0.950