Text-to-Speech
coqui

inference without voice cloning

#9
by gqd - opened

Hey

  • All the examples show how to produce output with a speaker voice
  • Wondering if it's possible to do fine-tuning on a speaker voice and then inference without passing a reference sample to reduce latency?

Thx

gqd changed discussion title from usage without voice cloning to inference without voice cloning
Coqui.ai org

Once you calculated latents , you can pass same latents to inference there after, that reduces inference time.
Please check code on https://huggingface.co/spaces/coqui/xtts/blob/main/app.py#L233
gpt_cond_latent,speaker_embedding = model.get_conditioning_latents(audio_path=speaker_wav, gpt_cond_len=30, max_ref_length=60)

Hey @gorkemgoknar

Is it possible to fine-tune and inference coqui/XTTS-v2 as a single speaker model entirely, to remove the additional latency of using the latents?

Or wouldn't that make much of a difference, when using the precomputed latents as you suggested?

Thx

Sign up or log in to comment