Grapheme-to-Phoneme Model
Hi, thanks for this fantastic work! I'm curious to know how you converted the transcriptions in MCV into phonemes. Could you share a bit about the process?
Kindly ping @zinc75
Dear Bofeng,
Thanks for your interest in our work !
After investigating several solutions for the G2P task (including pretty recent neural solutions), we used the bootphon/phonemizer
using the EspeakBackend
on all the text data of CommonVoice ( Github bootphon/phonemizer ) before generating the vocab.json
needed by the tokenizer for wav2vec2.
This can be simply extended to any language / dataset with text transcriptions, as long as the phonemizer backend is powerful enough to not introduce errors in grapheme to phoneme translation.
Hope this helps,
Best regards,
Éric Bavu
Senior Researcher / Full Professor - Acoustics
Cnam/LMSSC
Thanks for your detailed response, Eric!
I've also experimented with phonemizer and found its results much more precise than other tools such as epitran (https://github.com/dmort27/epitran). However, it takes a bit more time. I'll launch it on my dataset to see if the running time is acceptable.
@bofenghuang
: if you want to reduce the time footprint for G2P data preparation, you can use the map
function of :huggingface: datasets
with the batched=True
option, and prepare the data once and for all before training to create a dataset with a phonetic transcription attribute added to the audio and text transcription.
It doesn't take long to go through CommonVoice 13 fr, for example, which corresponds to 2.5 k hours of audio (and this is done one for once before training, thus not slowing the training process. ).
Best Regards,
Éric Bavu
Senior Researcher / Full Professor - Acoustics
Cnam/LMSSC