w11wo's picture
Added Model
aca26e1
metadata
language: sw
license: cc-by-sa-4.0
tags:
  - tensorflowtts
  - audio
  - text-to-speech
  - mel-to-wav
inference: false
datasets:
  - bookbot/OpenBible_Swahili

MB-MelGAN HiFi PostNets SW v1

MB-MelGAN HiFi PostNets SW v1 is a mel-to-wav model based on the MB-MelGAN architecture with HiFi-GAN discriminator. This model was trained from scratch on a synthetic audio dataset. Instead of training on ground truth waveform spectrograms, this model was trained on the generated PostNet spectrograms of LightSpeech MFA SW v1. The list of real speakers include:

  • sw-KE-OpenBible

This model was trained using the TensorFlowTTS framework. All training was done on a Scaleway RENDER-S VM with a Tesla P100 GPU. All necessary scripts used for training could be found in this Github Fork, as well as the Training metrics logged via Tensorboard.

Model

Model Config SR (Hz) Mel range (Hz) FFT / Hop / Win (pt) #steps
mb-melgan-hifi-postnets-sw-v1 Link 44.1K 20-11025 2048 / 512 / None 1M

Training Procedure

Feature Extraction Setting
sampling_rate: 44100
hop_size: 512 # Hop size.
format: "npy"
Generator Network Architecture Setting
model_type: "multiband_melgan_generator"

multiband_melgan_generator_params:
    out_channels: 4 # Number of output channels (number of subbands).
    kernel_size: 7 # Kernel size of initial and final conv layers.
    filters: 384 # Initial number of channels for conv layers.
    upsample_scales: [8, 4, 4] # List of Upsampling scales.
    stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
    stacks: 4 # Number of stacks in a single residual stack module.
    is_weight_norm: false # Use weight-norm or not.
Discriminator Network Architecture Setting
multiband_melgan_discriminator_params:
    out_channels: 1 # Number of output channels.
    scales: 3 # Number of multi-scales.
    downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.
    downsample_pooling_params: # Parameters of the above pooling function.
        pool_size: 4
        strides: 2
    kernel_sizes: [5, 3] # List of kernel size.
    filters: 16 # Number of channels of the initial conv layer.
    max_downsample_filters: 512 # Maximum number of channels of downsampling layers.
    downsample_scales: [4, 4, 4] # List of downsampling scales.
    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
    nonlinear_activation_params: # Parameters of nonlinear activation function.
        alpha: 0.2
    is_weight_norm: false # Use weight-norm or not.

hifigan_discriminator_params:
    out_channels: 1 # Number of output channels (number of subbands).
    period_scales: [3, 5, 7, 11, 17, 23, 37] # List of period scales.
    n_layers: 5 # Number of layer of each period discriminator.
    kernel_size: 5 # Kernel size.
    strides: 3 # Strides
    filters: 8 # In Conv filters of each period discriminator
    filter_scales: 4 # Filter scales.
    max_filters: 512 # maximum filters of period discriminator's conv.
    is_weight_norm: false # Use weight-norm or not.
STFT Loss Setting
stft_loss_params:
    fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.
    frame_steps: [120, 240, 50] # List of hop size for STFT-based loss
    frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.

subband_stft_loss_params:
    fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss.
    frame_steps: [30, 60, 10] # List of hop size for STFT-based loss
    frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
Adversarial Loss Setting
lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss
lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.
Data Loader Setting
batch_size: 32 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
eval_batch_size: 16
batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size.
batch_max_steps_valid: 8192 # Length of each audio for validation. Make sure dividable by hope_size.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: false # Whether to allow cache in dataset. If true, it requires cpu memory.
is_shuffle: false # shuffle dataset after each epoch.
Optimizer & Scheduler Setting
generator_optimizer_params:
    lr_fn: "PiecewiseConstantDecay"
    lr_params:
        boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
        values:
            [
                0.0005,
                0.0005,
                0.00025,
                0.000125,
                0.0000625,
                0.00003125,
                0.000015625,
                0.000001,
            ]
    amsgrad: false

discriminator_optimizer_params:
    lr_fn: "PiecewiseConstantDecay"
    lr_params:
        boundaries: [100000, 200000, 300000, 400000, 500000]
        values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
    amsgrad: false

gradient_accumulation_steps: 1
Interval Setting
discriminator_train_start_steps: 200000 # steps begin training discriminator
train_max_steps: 1000000 # Number of training steps.
save_interval_steps: 20000 # Interval steps to save checkpoint.
eval_interval_steps: 5000 # Interval steps to evaluate the network.
log_interval_steps: 200 # Interval steps to record the training log.
Other Setting
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.

How to Use

import soundfile as sf
import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel, AutoProcessor

lightspeech = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-sw-v1")
processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-sw-v1")
mb_melgan = TFAutoModel.from_pretrained("bookbot/mb-melgan-hifi-postnets-sw-v1")

text, speaker_name = "Hello World.", "sw-KE-OpenBible"
input_ids = processor.text_to_sequence(text)

mel, _, _ = lightspeech.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor(
        [processor.speakers_map[speaker_name]], dtype=tf.int32
    ),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
)

audio = mb_melgan.inference(mel)[0, :, 0]
sf.write("./audio.wav", audio, 44100, "PCM_16")

Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

Authors

MB-MelGAN HiFi PostNets SW v1 was trained and evaluated by David Samuel Setiawan, Wilson Wongso. All computation and development are done on Scaleway.

Framework versions

  • TensorFlowTTS 1.8
  • TensorFlow 2.7.0