A FULL TRAINING PATHWAY ?
will you design a full training pathway ? using the hugging face trainer or datasets?
when we train are we essentially giving the model the inputs (texxt) and the outputs(sound) ?
Can we train the model with audio books ?
ie the text in and the audio book as output ( the model should try to difuse its way towrds this output) ....then we can train the model on many voices ... male readers and female readers as some books on gutenberg and internet archives have multuiple speakers ... hence after intensive training we could essentially generate some form of voice ? male or female old or young ?...
hence we can train specific phonetics for cetain extinct languges and given the corect phonetical input get the correct outputed words ie would could hear the actual languges of the ancients with the voices we have trained ... as they if not over fit would also bcome a paprt of its geneartive voice ablity...at present i would expect the model could be trained on a wide range of sounds and thier sub sounds ... as well as phonetical alpha beths .. hence by training phonetics as well it will have the smaller indivicual componets construct these sounds using whatever shape it needs ....
because this is a sound generation model ... these usecase will expand to crerating peta kutchas for documents and a table , so given a table (disceted by blip) - Give somerepresentation of the data in sound... (colours as frequecies etc) ... all types of experiments :
the only extension for the next model is to be able to add sound to the input , so the sound can be merged with the output sound ... perhaps as a Multitrack or meore inteligent ?
great work by the way ~~ Well done ..
one ISSUE --- Why is the library for downloading rhe model also , why did i have to patch the source code to be able to load a local model ??
now i have the model local and can load it i would like a good training pathway ! Please:
voices generated maybe roboti and maybe not ! ... but more training of audio books and conversations (very good subtitles) we can train the model for many types of recognition even fantasy ... ie sound of the tardis or sound of the cylon .... languge of the klingon .... as we have the translations in subtitles so we can have the expected output and train until the loss is reduced then we know the model can truly represent the sound or something simular :
im not sure what your loss values were in training (what was acceptable to know the task was embedded for each labled or clasified sample created :
how would we train it for a new task type ? ...ie question and answer ?