Glow TTS#

Glow TTS is a normalizing flow model for text-to-speech. It is built on the generic Glow model that is previously used in computer vision and vocoder models. It uses “monotonic alignment search” (MAS) to fine the text-to-speech alignment and uses the output to train a separate duration predictor network for faster inference run-time.

Important resources & papers#

GlowTTS Config#

class TTS.tts.configs.glow_tts_config.GlowTTSConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', save_on_interrupt=True, log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, run_eval_steps=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, precision='fp16', epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=5.0, scheduler_after_epoch=True, lr=0.001, optimizer='RAdam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, use_grad_scaler=False, allow_tf32=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=False, training_seed=54321, model='glow_tts', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=None, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=False, compute_energy=False, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, shuffle=False, drop_last=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, use_length_weighted_sampler=False, length_weighted_sampler_alpha=1.0, num_chars=None, encoder_type='rel_pos_transformer', encoder_params=<factory>, use_encoder_prenet=True, hidden_channels_enc=192, hidden_channels_dec=192, hidden_channels_dp=256, dropout_p_dp=0.1, dropout_p_dec=0.05, mean_only=True, out_channels=80, num_flow_blocks_dec=12, inference_noise_scale=0.0, kernel_size_dec=5, dilation_rate=1, num_block_layers=4, num_speakers=0, c_in_channels=0, num_splits=4, num_squeeze=2, sigmoid_scale=False, d_vector_dim=0, data_dep_init_steps=10, style_wav_for_test=None, length_scale=1.0, use_speaker_embedding=False, speakers_file=None, use_d_vector_file=False, d_vector_file=False, min_seq_len=3, max_seq_len=500, r=1)[source]#

Defines parameters for GlowTTS model.


>>> from TTS.tts.configs.glow_tts_config import GlowTTSConfig
>>> config = GlowTTSConfig()
  • model (str) – Model name used for selecting the right model at initialization. Defaults to glow_tts.

  • encoder_type (str) – Type of the encoder used by the model. Look at TTS.tts.layers.glow_tts.encoder for more details. Defaults to rel_pos_transformers.

  • encoder_params (dict) – Parameters used to define the encoder network. Look at TTS.tts.layers.glow_tts.encoder for more details. Defaults to {“kernel_size”: 3, “dropout_p”: 0.1, “num_layers”: 6, “num_heads”: 2, “hidden_channels_ffn”: 768}

  • use_encoder_prenet (bool) – enable / disable the use of a prenet for the encoder. Defaults to True.

  • hidden_channels_enc (int) – Number of base hidden channels used by the encoder network. It defines the input and the output channel sizes, and for some encoder types internal hidden channels sizes too. Defaults to 192.

  • hidden_channels_dec (int) – Number of base hidden channels used by the decoder WaveNet network. Defaults to 192 as in the original work.

  • hidden_channels_dp (int) – Number of layer channels of the duration predictor network. Defaults to 256 as in the original work.

  • mean_only (bool) – If true predict only the mean values by the decoder flow. Defaults to True.

  • out_channels (int) – Number of channels of the model output tensor. Defaults to 80.

  • num_flow_blocks_dec (int) – Number of decoder blocks. Defaults to 12.

  • inference_noise_scale (float) – Noise scale used at inference. Defaults to 0.33.

  • kernel_size_dec (int) – Decoder kernel size. Defaults to 5

  • dilation_rate (int) – Rate to increase dilation by each layer in a decoder block. Defaults to 1.

  • num_block_layers (int) – Number of decoder layers in each decoder block. Defaults to 4.

  • dropout_p_dec (float) – Dropout rate for decoder. Defaults to 0.1.

  • num_speaker (int) – Number of speaker to define the size of speaker embedding layer. Defaults to 0.

  • c_in_channels (int) – Number of speaker embedding channels. It is set to 512 if embeddings are learned. Defaults to 0.

  • num_splits (int) – Number of split levels in inversible conv1x1 operation. Defaults to 4.

  • num_squeeze (int) – Number of squeeze levels. When squeezing channels increases and time steps reduces by the factor ‘num_squeeze’. Defaults to 2.

  • sigmoid_scale (bool) – enable/disable sigmoid scaling in decoder. Defaults to False.

  • mean_only – If True, encoder only computes mean value and uses constant variance for each time step. Defaults to true.

  • encoder_type – Encoder module type. Possible values are`[“rel_pos_transformer”, “gated_conv”, “residual_conv_bn”, “time_depth_separable”]` Check TTS.tts.layers.glow_tts.encoder for more details. Defaults to rel_pos_transformers as in the original paper.

  • encoder_params – Encoder module parameters. Defaults to None.

  • d_vector_dim (int) – Channels of external speaker embedding vectors. Defaults to 0.

  • data_dep_init_steps (int) – Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.

  • style_wav_for_test (str) – Path to the wav file used for changing the style of the speech. Defaults to None.

  • inference_noise_scale – Variance used for sampling the random noise added to the decoder’s input at inference. Defaults to 0.0.

  • length_scale (float) – Multiply the predicted durations with this value to change the speech speed. Defaults to 1.

  • use_speaker_embedding (bool) – enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.

  • use_d_vector_file (bool) – enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.

  • d_vector_file (str) – Path to the file including pre-computed speaker embeddings. Defaults to None.

  • noam_schedule (bool) – enable / disable the use of Noam LR scheduler. Defaults to False.

  • warmup_steps (int) – Number of warm-up steps for the Noam scheduler. Defaults 4000.

  • lr (float) – Initial learning rate. Defaults to 1e-3.

  • wd (float) – Weight decay coefficient. Defaults to 1e-7.

  • min_seq_len (int) – Minimum input sequence length to be used at training.

  • max_seq_len (int) – Maximum input sequence length to be used at training. Larger values result in more VRAM usage.

GlowTTS Model#

class TTS.tts.models.glow_tts.GlowTTS(config, ap=None, tokenizer=None, speaker_manager=None)[source]#

GlowTTS model.


Paper abstract::

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that our model can be easily extended to a multi-speaker setting.

Check TTS.tts.configs.glow_tts_config.GlowTTSConfig for class arguments.


Init only model layers.

>>> from TTS.tts.configs.glow_tts_config import GlowTTSConfig
>>> from TTS.tts.models.glow_tts import GlowTTS
>>> config = GlowTTSConfig(num_chars=2)
>>> model = GlowTTS(config)

Fully init a model ready for action. All the class attributes and class members (e.g Tokenizer, AudioProcessor, etc.). are initialized internally based on config values.

>>> from TTS.tts.configs.glow_tts_config import GlowTTSConfig
>>> from TTS.tts.models.glow_tts import GlowTTS
>>> config = GlowTTSConfig()
>>> model = GlowTTS.init_from_config(config, verbose=False)
static compute_outputs(attn, o_mean, o_log_scale, x_mask)[source]#

Compute and format the mode outputs with the given alignment map

decoder_inference(y, y_lengths=None, aux_input={'d_vectors': None, 'speaker_ids': None})[source]#


  • y: \([B, T, C]\)

  • y_lengths: \(B\)

  • g: \([B, C] or B\)

forward(x, x_lengths, y, y_lengths=None, aux_input={'d_vectors': None, 'speaker_ids': None})[source]#
  • x (torch.Tensor) – Input text sequence ids. \([B, T_en]\)

  • x_lengths (torch.Tensor) – Lengths of input text sequences. \([B]\)

  • y (torch.Tensor) – Target mel-spectrogram frames. \([B, T_de, C_mel]\)

  • y_lengths (torch.Tensor) – Lengths of target mel-spectrogram frames. \([B]\)

  • aux_input (Dict) – Auxiliary inputs. d_vectors is speaker embedding vectors for a multi-speaker model. \([B, D_vec]\). speaker_ids is speaker ids for a multi-speaker model usind speaker-embedding layer. \(B\)


  • z: :math: [B, T_de, C]

  • logdet: \(B\)

  • y_mean: \([B, T_de, C]\)

  • y_log_scale: \([B, T_de, C]\)

  • alignments: \([B, T_en, T_de]\)

  • durations_log: \([B, T_en, 1]\)

  • total_durations_log: \([B, T_en, 1]\)

Return type:


inference_with_MAS(x, x_lengths, y=None, y_lengths=None, aux_input={'d_vectors': None, 'speaker_ids': None})[source]#

It’s similar to the teacher forcing in Tacotron. It was proposed in:


  • x: \([B, T]\)

  • x_lenghts: \(B\)

  • y: \([B, T, C]\)

  • y_lengths: \(B\)

  • g: \([B, C] or B\)

static init_from_config(config, samples=None, verbose=True)[source]#

Initiate model from config

  • config (VitsConfig) – Model config.

  • samples (Union[List[List], List[Dict]]) – Training samples to parse speaker ids for training. Defaults to None.

  • verbose (bool) – If True, print init messages. Defaults to True.


Init speaker embedding layer if use_speaker_embedding is True and set the expected speaker embedding vector dimension to the encoder layer channel size. If model uses d-vectors, then it only sets speaker embedding vector dimension to the d-vector dimension from the config.


config (Coqpit) – Model configuration.


Lock activation normalization layers.


Decide on every training step wheter enable/disable data depended initialization.


Generic test run for tts models used by Trainer.

You can override this for a different behaviour.


Test figures and audios to be projected to Tensorboard.

Return type:

Tuple[Dict, Dict]

train_step(batch, criterion)[source]#

A single training step. Forward pass and loss computation. Run data depended initialization for the first config.data_dep_init_steps steps.

  • batch (dict) – [description]

  • criterion (nn.Module) – [description]


Unlock activation normalization layers for data depended initalization.