# TTS 赋予数字人真实的语音交互能力 ## Edge-TTS Edge-TTS是一个Python库,它使用微软的Azure Cognitive Services来实现文本到语音转换(TTS)。 该库提供了一个简单的API,可以将文本转换为语音,并且支持多种语言和声音。要使用Edge-TTS库,首先需要安装上Edge-TTS库,安装直接使用pip 进行安装即可。 ```bash pip install -U edge-tts ``` > 如果想更细究使用方式,可参考[https://github.com/rany2/edge-tts](https://github.com/rany2/edge-tts) 根据源代码,我编写了一个 `EdgeTTS` 的类,能够更好的使用,并且增加了保存字幕文件的功能,能增加体验感 ```python class EdgeTTS: def __init__(self, list_voices = False, proxy = None) -> None: voices = list_voices_fn(proxy=proxy) self.SUPPORTED_VOICE = [item['ShortName'] for item in voices] self.SUPPORTED_VOICE.sort(reverse=True) if list_voices: print(", ".join(self.SUPPORTED_VOICE)) def preprocess(self, rate, volume, pitch): if rate >= 0: rate = f'+{rate}%' else: rate = f'{rate}%' if pitch >= 0: pitch = f'+{pitch}Hz' else: pitch = f'{pitch}Hz' volume = 100 - volume volume = f'-{volume}%' return rate, volume, pitch def predict(self,TEXT, VOICE, RATE, VOLUME, PITCH, OUTPUT_FILE='result.wav', OUTPUT_SUBS='result.vtt', words_in_cue = 8): async def amain() -> None: """Main function""" rate, volume, pitch = self.preprocess(rate = RATE, volume = VOLUME, pitch = PITCH) communicate = Communicate(TEXT, VOICE, rate = rate, volume = volume, pitch = pitch) subs: SubMaker = SubMaker() sub_file: Union[TextIOWrapper, TextIO] = ( open(OUTPUT_SUBS, "w", encoding="utf-8") ) async for chunk in communicate.stream(): if chunk["type"] == "audio": # audio_file.write(chunk["data"]) pass elif chunk["type"] == "WordBoundary": # print((chunk["offset"], chunk["duration"]), chunk["text"]) subs.create_sub((chunk["offset"], chunk["duration"]), chunk["text"]) sub_file.write(subs.generate_subs(words_in_cue)) await communicate.save(OUTPUT_FILE) # loop = asyncio.get_event_loop_policy().get_event_loop() # try: # loop.run_until_complete(amain()) # finally: # loop.close() asyncio.run(amain()) with open(OUTPUT_SUBS, 'r', encoding='utf-8') as file: vtt_lines = file.readlines() # 去掉每一行文字中的空格 vtt_lines_without_spaces = [line.replace(" ", "") if "-->" not in line else line for line in vtt_lines] # print(vtt_lines_without_spaces) with open(OUTPUT_SUBS, 'w', encoding='utf-8') as output_file: output_file.writelines(vtt_lines_without_spaces) return OUTPUT_FILE, OUTPUT_SUBS ``` 同时在`src`文件夹下,写了一个简易的`WebUI` ```bash python app.py ``` ![TTS](../docs/TTS.png) ## PaddleTTS 在实际使用过程中,可能会遇到需要离线操作的情况。由于Edge TTS需要在线环境才能生成语音,因此我们选择了同样开源的PaddleSpeech作为文本到语音(TTS)的替代方案。虽然效果可能有所不同,但PaddleSpeech支持离线操作。更多信息可参考PaddleSpeech的GitHub页面:[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)。 ### 声码器说明 PaddleSpeech预置了三种声码器:【PWGan】【WaveRnn】【HifiGan】。这三种声码器在音质和生成速度上有较大差异,用户可根据需求进行选择。我们建议仅使用前两种声码器,因为WaveRNN的生成速度非常慢。 | 声码器 | 音频质量 | 生成速度 | | :-----: | :------: | :----------------: | | PWGan | 中等 | 中等 | | WaveRnn | 高 | 非常慢(耐心等待) | | HifiGan | 低 | 快 | ### TTS数据集 PaddleSpeech中的样例主要按数据集分类,我们主要使用的TTS数据集有: - CSMCS(普通话单发音人) - AISHELL3(普通话多发音人) - LJSpeech(英文单发音人) - VCTK(英文多发音人) ### PaddleSpeech的TTS模型映射 PaddleSpeech的TTS模型与以下模型相对应: - tts0 - Tacotron2 - tts1 - TransformerTTS - tts2 - SpeedySpeech - tts3 - FastSpeech2 - voc0 - WaveFlow - voc1 - Parallel WaveGAN - voc2 - MelGAN - voc3 - MultiBand MelGAN - voc4 - Style MelGAN - voc5 - HiFiGAN - vc0 - Tacotron2 Voice Clone with GE2E - vc1 - FastSpeech2 Voice Clone with GE2E ### 预训练模型列表 以下是PaddleSpeech提供的可通过命令行和Python API使用的预训练模型列表: #### 声学模型 | 模型 | 语言 | | :--------------------------- | :----: | | speedyspeech_csmsc | zh | | fastspeech2_csmsc | zh | | fastspeech2_ljspeech | en | | fastspeech2_aishell3 | zh | | fastspeech2_vctk | en | | fastspeech2_cnndecoder_csmsc | zh | | fastspeech2_mix | mix | | tacotron2_csmsc | zh | | tacotron2_ljspeech | en | | fastspeech2_male | zh | | fastspeech2_male | en | | fastspeech2_male | mix | | fastspeech2_canton | canton | #### 声码器 | 模型 | 语言 | | :----------------- | :--: | | pwgan_csmsc | zh | | pwgan_ljspeech | en | | pwgan_aishell3 | zh | | pwgan_vctk | en | | mb_melgan_csmsc | zh | | style_melgan_csmsc | zh | | hifigan_csmsc | zh | | hifigan_ljspeech | en | | hifigan_aishell3 | zh | | hifigan_vctk | en | | wavernn_csmsc | zh | | pwgan_male | zh | | hifigan_male | zh | 根据PaddleSpeech,我编写了一个 `PaddleTTS` 的类,能够更好的使用和运行结果 ```python class PaddleTTS: def __init__(self) -> None: pass def predict(self, text, am, voc, spk_id = 174, lang = 'zh', male=False, save_path = 'output.wav'): self.tts = TTSExecutor() use_onnx = True voc = voc.lower() am = am.lower() if male: assert voc in ["pwgan", "hifigan"], "male voc must be 'pwgan' or 'hifigan'" wav_file = self.tts( text = text, output = save_path, am='fastspeech2_male', voc= voc + '_male', lang=lang, use_onnx=use_onnx ) return wav_file assert am in ['tacotron2', 'fastspeech2'], "am must be 'tacotron2' or 'fastspeech2'" # 混合中文英文语音合成 if lang == 'mix': # mix只有fastspeech2 am = 'fastspeech2_mix' voc += '_csmsc' # 英文语音合成 elif lang == 'en': am += '_ljspeech' voc += '_ljspeech' # 中文语音合成 elif lang == 'zh': assert voc in ['wavernn', 'pwgan', 'hifigan', 'style_melgan', 'mb_melgan'], "voc must be 'wavernn' or 'pwgan' or 'hifigan' or 'style_melgan' or 'mb_melgan'" am += '_csmsc' voc += '_csmsc' elif lang == 'canton': am = 'fastspeech2_canton' voc = 'pwgan_aishell3' spk_id = 10 print("am:", am, "voc:", voc, "lang:", lang, "male:", male, "spk_id:", spk_id) try: cmd = f'paddlespeech tts --am {am} --voc {voc} --input "{text}" --output {save_path} --lang {lang} --spk_id {spk_id} --use_onnx {use_onnx}' os.system(cmd) wav_file = save_path except: # 语音合成 wav_file = self.tts( text = text, output = save_path, am = am, voc = voc, lang = lang, spk_id = spk_id, use_onnx=use_onnx ) return wav_file ```