F5-TTS & E2-TTS: Zero-Shot Voice Cloning (Unofficial Demo)
More advanced and challenging multi-task evaluation
a tiny vision language model