MusicGen

Running on A10G

App Files Files Community

MusicGen / docs /METRICS.md

reach-vb HF staff

Stereo demo update (#60)

5325fcc about 1 year ago

preview code

raw

history blame contribute delete

5.77 kB

	# AudioCraft objective metrics

	In addition to training losses, AudioCraft provides a set of objective metrics
	for audio synthesis and audio generation. As these metrics may require
	extra dependencies and can be costly to train, they are often disabled by default.
	This section provides guidance for setting up and using these metrics in
	the AudioCraft training pipelines.

	## Available metrics

	### Audio synthesis quality metrics

	#### SI-SNR

	We provide an implementation of the Scale-Invariant Signal-to-Noise Ratio in PyTorch.
	No specific requirement is needed for this metric. Please activate the metric at the
	evaluation stage with the appropriate flag:

	Warning: We report the opposite of the SI-SNR, e.g. multiplied by -1. This is due to internal
	details where the SI-SNR score can also be used as a training loss function, where lower
	values should indicate better reconstruction. Negative values are such expected and a good sign! Those should be again multiplied by `-1` before publication :)

	```shell
	dora run <...> evaluate.metrics.sisnr=true
	```

	#### ViSQOL

	We provide a Python wrapper around the ViSQOL [official implementation](https://github.com/google/visqol)
	to conveniently run ViSQOL within the training pipelines.

	One must specify the path to the ViSQOL installation through the configuration in order
	to enable ViSQOL computations in AudioCraft:

	```shell
	# the first parameter is used to activate visqol computation while the second specify
	# the path to visqol's library to be used by our python wrapper
	dora run <...> evaluate.metrics.visqol=true metrics.visqol.bin=<path_to_visqol>
	```

	See an example grid: [Compression with ViSQOL](../audiocraft/grids/compression/encodec_musicgen_32khz.py)

	To learn more about ViSQOL and how to build ViSQOL binary using bazel, please refer to the
	instructions available in the [open source repository](https://github.com/google/visqol).

	### Audio generation metrics

	#### Frechet Audio Distance

	Similarly to ViSQOL, we use a Python wrapper around the Frechet Audio Distance
	[official implementation](https://github.com/google-research/google-research/tree/master/frechet_audio_distance)
	in TensorFlow.

	Note that we had to make several changes to the actual code in order to make it work.
	Please refer to the [FrechetAudioDistanceMetric](../audiocraft/metrics/fad.py) class documentation
	for more details. We do not plan to provide further support in obtaining a working setup for the
	Frechet Audio Distance at this stage.

	```shell
	# the first parameter is used to activate FAD metric computation while the second specify
	# the path to FAD library to be used by our python wrapper
	dora run <...> evaluate.metrics.fad=true metrics.fad.bin=<path_to_google_research_repository>
	```

	See an example grid: [Evaluation with FAD](../audiocraft/grids/musicgen/musicgen_pretrained_32khz_eval.py)

	#### Kullback-Leibler Divergence

	We provide a PyTorch implementation of the Kullback-Leibler Divergence computed over the probabilities
	of the labels obtained by a state-of-the-art audio classifier. We provide our implementation of the KLD
	using the [PaSST classifier](https://github.com/kkoutini/PaSST).

	In order to use the KLD metric over PaSST, you must install the PaSST library as an extra dependency:
	```shell
	pip install 'git+https://github.com/kkoutini/[email protected]#egg=hear21passt'
	```

	Then similarly, you can use the metric activating the corresponding flag:

	```shell
	# one could extend the kld metric with additional audio classifier models that can then be picked through the configuration
	dora run <...> evaluate.metrics.kld=true metrics.kld.model=passt
	```

	#### Text consistency

	We provide a text-consistency metric, similarly to the MuLan Cycle Consistency from
	[MusicLM](https://arxiv.org/pdf/2301.11325.pdf) or the CLAP score used in
	[Make-An-Audio](https://arxiv.org/pdf/2301.12661v1.pdf).
	More specifically, we provide a PyTorch implementation of a Text consistency metric
	relying on a pre-trained [Contrastive Language-Audio Pretraining (CLAP)](https://github.com/LAION-AI/CLAP).

	Please install the CLAP library as an extra dependency prior to using the metric:
	```shell
	pip install laion_clap
	```

	Then similarly, you can use the metric activating the corresponding flag:

	```shell
	# one could extend the text consistency metric with additional audio classifier models that can then be picked through the configuration
	dora run ... evaluate.metrics.text_consistency=true metrics.text_consistency.model=clap
	```

	Note that the text consistency metric based on CLAP will require the CLAP checkpoint to be
	provided in the configuration.

	#### Chroma cosine similarity

	Finally, as introduced in MusicGen, we provide a Chroma Cosine Similarity metric in PyTorch.
	No specific requirement is needed for this metric. Please activate the metric at the
	evaluation stage with the appropriate flag:

	```shell
	dora run ... evaluate.metrics.chroma_cosine=true
	```

	#### Comparing against reconstructed audio

	For all the above audio generation metrics, we offer the option to compute the metric on the reconstructed audio
	fed in EnCodec instead of the generated sample using the flag `<metric>.use_gt=true`.

	## Example usage

	You will find example of configuration for the different metrics introduced above in:
	* The [musicgen's default solver](../config/solver/musicgen/default.yaml) for all audio generation metrics
	* The [compression's default solver](../config/solver/compression/default.yaml) for all audio synthesis metrics

	Similarly, we provide different examples in our grids:
	* [Evaluation with ViSQOL](../audiocraft/grids/compression/encodec_musicgen_32khz.py)
	* [Evaluation with FAD and others](../audiocraft/grids/musicgen/musicgen_pretrained_32khz_eval.py)