Spaces:
Running
on
A10G
Running
on
A10G
# AudioCraft objective metrics | |
In addition to training losses, AudioCraft provides a set of objective metrics | |
for audio synthesis and audio generation. As these metrics may require | |
extra dependencies and can be costly to train, they are often disabled by default. | |
This section provides guidance for setting up and using these metrics in | |
the AudioCraft training pipelines. | |
## Available metrics | |
### Audio synthesis quality metrics | |
#### SI-SNR | |
We provide an implementation of the Scale-Invariant Signal-to-Noise Ratio in PyTorch. | |
No specific requirement is needed for this metric. Please activate the metric at the | |
evaluation stage with the appropriate flag: | |
**Warning:** We report the opposite of the SI-SNR, e.g. multiplied by -1. This is due to internal | |
details where the SI-SNR score can also be used as a training loss function, where lower | |
values should indicate better reconstruction. Negative values are such expected and a good sign! Those should be again multiplied by `-1` before publication :) | |
```shell | |
dora run <...> evaluate.metrics.sisnr=true | |
``` | |
#### ViSQOL | |
We provide a Python wrapper around the ViSQOL [official implementation](https://github.com/google/visqol) | |
to conveniently run ViSQOL within the training pipelines. | |
One must specify the path to the ViSQOL installation through the configuration in order | |
to enable ViSQOL computations in AudioCraft: | |
```shell | |
# the first parameter is used to activate visqol computation while the second specify | |
# the path to visqol's library to be used by our python wrapper | |
dora run <...> evaluate.metrics.visqol=true metrics.visqol.bin=<path_to_visqol> | |
``` | |
See an example grid: [Compression with ViSQOL](../audiocraft/grids/compression/encodec_musicgen_32khz.py) | |
To learn more about ViSQOL and how to build ViSQOL binary using bazel, please refer to the | |
instructions available in the [open source repository](https://github.com/google/visqol). | |
### Audio generation metrics | |
#### Frechet Audio Distance | |
Similarly to ViSQOL, we use a Python wrapper around the Frechet Audio Distance | |
[official implementation](https://github.com/google-research/google-research/tree/master/frechet_audio_distance) | |
in TensorFlow. | |
Note that we had to make several changes to the actual code in order to make it work. | |
Please refer to the [FrechetAudioDistanceMetric](../audiocraft/metrics/fad.py) class documentation | |
for more details. We do not plan to provide further support in obtaining a working setup for the | |
Frechet Audio Distance at this stage. | |
```shell | |
# the first parameter is used to activate FAD metric computation while the second specify | |
# the path to FAD library to be used by our python wrapper | |
dora run <...> evaluate.metrics.fad=true metrics.fad.bin=<path_to_google_research_repository> | |
``` | |
See an example grid: [Evaluation with FAD](../audiocraft/grids/musicgen/musicgen_pretrained_32khz_eval.py) | |
#### Kullback-Leibler Divergence | |
We provide a PyTorch implementation of the Kullback-Leibler Divergence computed over the probabilities | |
of the labels obtained by a state-of-the-art audio classifier. We provide our implementation of the KLD | |
using the [PaSST classifier](https://github.com/kkoutini/PaSST). | |
In order to use the KLD metric over PaSST, you must install the PaSST library as an extra dependency: | |
```shell | |
pip install 'git+https://github.com/kkoutini/[email protected]#egg=hear21passt' | |
``` | |
Then similarly, you can use the metric activating the corresponding flag: | |
```shell | |
# one could extend the kld metric with additional audio classifier models that can then be picked through the configuration | |
dora run <...> evaluate.metrics.kld=true metrics.kld.model=passt | |
``` | |
#### Text consistency | |
We provide a text-consistency metric, similarly to the MuLan Cycle Consistency from | |
[MusicLM](https://arxiv.org/pdf/2301.11325.pdf) or the CLAP score used in | |
[Make-An-Audio](https://arxiv.org/pdf/2301.12661v1.pdf). | |
More specifically, we provide a PyTorch implementation of a Text consistency metric | |
relying on a pre-trained [Contrastive Language-Audio Pretraining (CLAP)](https://github.com/LAION-AI/CLAP). | |
Please install the CLAP library as an extra dependency prior to using the metric: | |
```shell | |
pip install laion_clap | |
``` | |
Then similarly, you can use the metric activating the corresponding flag: | |
```shell | |
# one could extend the text consistency metric with additional audio classifier models that can then be picked through the configuration | |
dora run ... evaluate.metrics.text_consistency=true metrics.text_consistency.model=clap | |
``` | |
Note that the text consistency metric based on CLAP will require the CLAP checkpoint to be | |
provided in the configuration. | |
#### Chroma cosine similarity | |
Finally, as introduced in MusicGen, we provide a Chroma Cosine Similarity metric in PyTorch. | |
No specific requirement is needed for this metric. Please activate the metric at the | |
evaluation stage with the appropriate flag: | |
```shell | |
dora run ... evaluate.metrics.chroma_cosine=true | |
``` | |
#### Comparing against reconstructed audio | |
For all the above audio generation metrics, we offer the option to compute the metric on the reconstructed audio | |
fed in EnCodec instead of the generated sample using the flag `<metric>.use_gt=true`. | |
## Example usage | |
You will find example of configuration for the different metrics introduced above in: | |
* The [musicgen's default solver](../config/solver/musicgen/default.yaml) for all audio generation metrics | |
* The [compression's default solver](../config/solver/compression/default.yaml) for all audio synthesis metrics | |
Similarly, we provide different examples in our grids: | |
* [Evaluation with ViSQOL](../audiocraft/grids/compression/encodec_musicgen_32khz.py) | |
* [Evaluation with FAD and others](../audiocraft/grids/musicgen/musicgen_pretrained_32khz_eval.py) | |