Use STT to compare against original text

#7
by Pendrokar - opened

Could use STT / ASR to better test if what is said matches the prompt text. And add the comprehensibility / WER as an extra column in Leaderboard.

Not sure if this should pop up before or after voting.

TTS-AGI stance:
The field of speech synthesis has long lacked an accurate method to measure the quality of different models. Objective metrics like WER (word error rate) are unreliable measures of model quality, and subjective measures such as MOS (mean opinion score) are typically small-scale experiments conducted with few listeners. As a result, these measurements are generally not useful for comparing two models of roughly similar quality.

It would also allow evaluating the output audio sample in order to then safely upload it as a synthetic audio dataset. Also to run Detoxify on that text. πŸ€”

Sign up or log in to comment