Model Inference Results in Distorted and Noisy Audio
Hello, I am experiencing some issues with the generated audio from the model during inference. Specifically:
The generated voice sounds quite distorted and lacks clarity, giving it a raspy quality.
There is a noticeable background noise that resembles electrical interference, which detracts from the overall audio quality.
I trained the model using the default configuration, with the following key parameters:
Number of training steps (n_steps): 100,000
Batch size (batch_size): 16
The dataset used for training is clean and free from any noise, so the distortion and noise appear to be introduced during the model's training or inference process.
Could you please provide any suggestions or insights on how to address these issues?
I'm pretty new to this and getting mixed results as I test things but have you tried the following:
Adjusting Training Steps: Is it possible you're over-training the model? The default is 20'000 so 5x is a big jump.
Adjusting Batch size, from my limited experience this has a huge impact, with cleaner datasets benefitting from a lower batch size.
Remember you can watch how it changes with tensorboard by opening it with the output folder as the log directory! :D
Thank you for your reply @Art3mas .
I've tried different training steps, ranging from 5,000 to 20,000, and also experimented with various batch sizes. However, the background noise after the voice conversion remains quite noticeable. The noise isn't extremely disruptive, but it doesn't sound as clean as the output from RVC. I'm trying to determine whether this background noise is unavoidable or if it's due to my dataset or some other factor.
Thank you for your suggestions and help! I really appreciate it.
Oh for sure. It will certainly be difficult to get the same level of fidelity that RVC has currently, keep in mind these are lightweight models that will run happily on a CPU rather than requiring a GPU with heavy chunks of CPU usage on top of that.
I'd continue experimenting with steps, batch sizes, and perhaps the amount of training data you use. Find out if more/less is helpful. Also keep in mind these models currently only use 16k samplerate for the input which would probably mean some drop in quality if your dataset is beyond that.
An update as I've been testing this more, I think the pretrained checkpoint is heavily skewed towards Japanese leaving the model sounding quite rough in other languages as far as I can tell, I'm curious if it might be offset with a stronger learning rate and a particularly heavy dataset
Yes, I agree that replacing the original pretrained checkpoint could yield better results. However, I haven’t explored how to implement that specific process. I appreciate your insights and look forward to hearing more about your findings.
I'm not sure either to be honest, I would suspect some heavy modifications to main.py would be needed to skip loading a checkpoint and starting with a much larger range for gradients, likely with early randomisation. I suspect you would also need an impressive dataset with verified samples and more computing power than I can push on a 4070. If I can somehow get some sort of proof of concept working I may see what the costs to compute this on a Colab machine would look.
I think I might have noticed something? It should have been obvious to me before. When the program is doing conversions in training for the 'audio' tab of tensorboards, it uses the ones included in the assets/test directory.
Seems pretty likely it will have no effect and just be for monitoring purposes, but even if it doesn't effect the end product, it'll be useful for testing.
At no point had I thought of swapping these.
I'm guessing you might have already done this, but either way I am running a test now. It would be interesting if this impacts the quality of the model, but I suspect it'll just be a useful tool to see if new issues are introduced mid-training.
I'm also curious to know if it reveals anything new or unexpected. I've been tied up with other projects recently and haven't had much time to dedicate to this one. However, I'm very interested in any progress that's being made and would love to stay updated. Thanks again for running with this idea!
What I've learned so far:
On an NVidia RTX 4070 Batch Size 8 works best, (In the range of 1.5-2.5 it/sec average when paired with an intel core i9 11900k CPU)
The slightly tinny buzzy noise on some results can be resolved by reducing the learning rate and increasing the steps (To an extent, very dataset dependant)
The model handles Japanese very well, but I have not yet found any way to resolve issues regarding the handling of English, in particular, consonants. Several parts of speech have a certain lisp-like quality in my experience.
The extent to which this impacts the quality is very dependant on the converted audio. A studio quality sample focusing on clear enunciation with a low noise floor will be far less impacted than a home recording with background noise.
I'm trying to deal with that part mainly, as it's the most annoying part in my opinion, even where a decent overall sound is found, the speech gets ruined entirely.
Additional note on the last point:
I was looking through my logs on tensorboard, with some more extreme tests in particular, and I've noticed a trend when it comes to loss of speech fidelity. The loss_g/loss_mel seems to align with the issue I'm noticing, on a test with an agressive learning rate I noticed it rise to a sharp peak before coming back down. This peak lined up in steps with the most distorted sounding files, running more tests to find out what brings mel down the most may be the path to success.
I'm going to run a test with a lower learning rate to smooth out the process (I find this makes for a slightly more even sounding model but...Who knows.) where I've adjusted the gradients weighting in the config, it's mentioned in documentation that the ratio is what matters, so I've interpreted this to mean if I want to increase mel I should decrease adv and fm. Although for this test I'm maginally increasing mel and adv and fm seems to be pretty stable across most training sessions. 1.1 mel 1.1 adv and 0.8 fm for this test.
I'll keep you posted!