Text to Speech Deep Learning Architectures

Small Intro. and Background

Recently, I started at Mozilla Research. I am really excited to be a part of a small but a great team working hard to solve important ML problems. And everything is open-sourced. We license things to make open-sourced. Oxymoron by the first sight, isn’t it. But I like it !!

Before my presence, our team already released the best known open-sourced STT (Speech to Text) implementation based on Tensorflow. The next step is to improve the current Baidu’s Deep Speech architecture and also implement a new TTS (Text to Speech) solution that complements the whole conversational AI agent. So after these two projects, anyone around the world will be able to create his own Alexa without any commercial attachment. Which is the real way to democratize AI, at least I believe it is.

Up until now, I worked on variety of data types and ML problems, except audio. Now it is time learn it. And the first thing to do is a comprehensive literature review (like a boss). Here I like to share the top notch DL architectures dealing with TTS (Text to Speech). I also invite you to our Github repository hosting PyTorch implementation of the first version implementation. (We switched to PyTorch for obvious reasons). It is a work in progress and please feel free to comment and contribute.

Below I like to share my pinpoint summary of the well-known TTS papers which are by no means complete but useful to highlight important aspects of these papers. Let’s start.

Glossary

Prosody:https://en.wikipedia.org/wiki/Prosody_(linguistics)
Phonemes : units of sounds, we pronounce as we speak. Necessary since very similar words in letter might be pronounced very differently (e.g. “Rough” “Though”)
Vocoder: part of the system decoding from features to audio signals. Wave is used in Deep Voice at that stage.
Fundamental Frequency - F0: lowest frequency of a periodic waveform describing the pitch of the sound.
Autoregressive Model:Specifies a model depending linearly on its own outputs and on a parameter set which can be approximated.
Query, Key, Value:Key is used by attention module to compute attention weights. Value is the vector stipulated by the attention weights to compute the module output. Query vector is the hidden state of the decoder.
Grapheme: Cool way to say character.
Error Modes: Sub-optimal status for the attention block where it is not able to escape.
Monotonic Attention: Use only a limited scope of nodes close in time to the output step. It improves performance for TTS since there is a certain relation btw the output at time t and the input at time t. However, it is not that reasonable for translation problem since words orders might no be the same. https://arxiv.org/pdf/1704.00784.pdf
MOS: Mean Opinion Score. Crows-source the evaluation process with native speakers. It is not easy to measure, especially for a layman.
Context vector: Output of an attention module which summarizes multiple time-step output of the encoder.
Hann Window Function: https://en.wikipedia.org/wiki/Window_function#Hann_window
Teacher Forcing:Providing model’s expected output at time t as a input at time t+1. It is controlled ground-truth feedback as a teacher does to a student.
Casual convolution: Convolution which does not foresee the future units given the reference time step T which we like to predict next. In practice, it is implemented by setting right padding orientation to to normal convolution layers.

Deep Voice (25 Feb 2017)

Text to phonemes. Deterministically computed with a dictionary. Or Seq2Seq model to deal with the unseen words.

The same phoneme might hold different durations in different words. We need to predict the duration. It is sequence depended.
Fundamental frequency for the pitch of the each phoneme. It is sequence depended.
Frequency + Phonemes + Duration = Voice synthesis. It is done via Google’s WaveNet.

Models

Segmentation Model

Segment audio signal to phonemes.
CTC loss
Predict phoneme pairs due to probability mass

Inputs:

Audio clip of â€œIt was early springâ€�

Phonemes (label)

[IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]**

Pairs of Phonemes with their start time

[(IH1, T, 0:00), (T, ., 0:01), (., W, 0:02), (W, AA1, 0:025), (NG, ., 0:035)]
Segmentation model predictions are the labels for these models.

Inputs:

Outputs:

Duration, Probability, F0 for each phoneme; [H, 0.1, 25hz], …
Simplified WaveNet

Inputs:

Duration and F0 for phonemes + audio signals (labels)

Deep Voice 2 (24 May 2017)

Using WaveNet as vocoder (synthesis from features to audio signal)
Experiments with Tacotron using WaveNet

Speaker embedding

Provide speaker embedding as;

initial hidden state for RNNs
concatenate the embedding with input features for every RNN steps.
multiply layer activation with the embedding
Finding phoneme boundaries
Predict phoneme pairs as a seq2seq problem
Deep Voice 1 model + BN + residual connections
Labels are discretized buckets of log-scale duration.
CRF is used to employ the dependence btw phonemes while predicting the sequence duration.
Use Praat software to extract fundamental frequency
Input: upsampled features by the duration predicted by the duration model.
Output: normalized f => F0 computed by speaker specific mean and std.
Inputs: F0 and phonemes computed by Duration and Frequency models.
Output: Synthesis speeches
Speaker embedding
Separation of duration and frequency models

Deep Voice 3 (20 Oct 2017)

Fully convolutional
Text -> Mel-band spectrogram -> Audio
WaveNet > WORLD > Griffin-Lim by the measure of MOS
WaveNet 3x , WORLD 40X real-time in CPU
Results are comparable to Tacotron but DeepVoice 3 faster in training ??
Monotonic Attention
Complex text preprocessing - A tools called Gentle used to find pauses btw words.â€œEither way%you should shoot/very slowly%.â€� - long pause after ‘way’, short pause after ‘short’

Joint representation of Chars and Phonemes

Fixing pronunciation errors by using a Phoneme dict.
Character embedding
Phoneme embedding
Char + Phoneme embedding
Given the embedding compute trainable internal representations.

Input:

Char, Phoneme or Char + Phoneme embedding vector
Encoded internal representation separated into Key, Value vectors.
Monotonic attention
Query, Key, Value*
Fixed window Softmax
Positional Encoding (allows attention on sequence without RNN)

Input:

Attended feature vector + Previous decoder output
next R audio feature frames
binary final frame prediction
Spectogram to Audio
Wavenet, WORLD or Griffin-Lim
Fully convolutional
WORLD vocoder in deployment
No Separate Duration, Frequency, Segmentation models. End-2-End as it can get ğŸ™‚

At a high-level, Easy and End2End but so many nitty-gritty details under the hood.

Weight norm for conv layers hidden in Appendix
Junction of inputs; chars+phonemes
Speaker embedding - trained how and when ?
“Running inference with a TensorFlow graph turns out to be prohibitively expensive, averaging approximately 1 QPS 9.”

Tacotron (6 April 2017)

End2End
Faster than WaveNet
Character sequence => Audio Spectrogram => Synthesized Audio
Example Results : https://google.github.io/tacotron/publications/tacotron/index.html
Decoder predicts r frames one at a time

Faster training

Decoder is autoregressive

Training with every r-th ground truth frame of time t-1.
Inference by providing the r-th prediction at time t-1 for the time step t
CBHG module
input: Embedding vector
output: encoded representation
input: encoded representation, decoder query
output: context vector
input: context vector, previous r-th frame output
output: r spectrogram frames
CBHG module
input: decoded r spectrogram frames
output: general audio frame can be synthesized to waveform
50 ms frame length
12.5 ms frame shift ?
2048 points Fourier Transform
24 kHz sampling for all experiments
Adam optimizer
r=2 and r=5 also works well
LR 0.001 -> 0.0005 -> 0.0003 -> 0.0001 after 500K -> 1M -> 2M steps
L1 loss for decoder and post-processing net with equal weights
32 batch size with max-length padded sequences
Attention module might be changed with Monotonic Attention used in Deep Voice 3
Griffin-Lim can be switched to WaveNet for better precision, WORLD for faster results.
Very straight forward from a high-level to implement
No support for multi-speaker.
CBHG is we tried and it worked kind of thing.
However Tacotron 2 results are much better regarding the prosody and sound.

Tacotron 2 (16 Dec 2017)

Examples: https://google.github.io/tacotron/publications/tacotron2/index.html
WaveNet vocoder
Text -> Mel-spectrogram -> Audio
In lie of linguistic features of WaveNet, Tacotron uses intermediate representation used from characters.
Location sensitive attention

Parameters:

50ms frames, 12,5 ms frame shift, a Hann Window function **
“We transform the STFT magnitude to the mel-scale using an 80 channel mel filterbank spanning 125 Hz to 7.6 kHz, followed by log dynamic range compression”
Normal convolution layers over CBHG modules.
WaveNet vocoder
Location sensitive attention to achieve monotonic change of attention weights.
GRU -> LSTM
Stop token prediction
MOS is not a good measure since test time recordings are not meant to be separate from the training set. The exact same sentence or very similar sequences might appear in the Test set. So the success of the model is mostly based on its memorization capabilities of the large dataset.
No speaker embedding.
All the dataset is a single person speech recorded by a professional.
They try to reduce WaveNet computation cost by learning the internal representation by a Encode - Decoder architecture.
Used character embedding does not say anything about pronunciation. Voice based embedding might be useful.

WaveNet (16 Sep 2016)

Casual convolution
Dilated convolution
Really slow for a real-life application.

(if someone provide me a summary happy to append here. I knew this paper in advance, therefore I skip it here.)

Voice Loop (20 July 2017)

No need for speech text alignment due to the encoder-decoder architecture.
No encoding is performed for the input text sequence. At each time step, only the corresponding embedding vector for the given character (phoneme) is used for the upper computations.
Context computation is done by a attentive memory buffer in lieu of RNN. For each time step a new vector is added to buffer as the earliest is removed. The buffer keeps the long-term information by using the the autoregressive connection to compute the new time step vector.
Using GMM based attention model (Graved 2013) which ensure the monotonic attention.
Speaker embedding is used for voice adaptation. The embedding is done by a multi-class network trained on speaker identification.
All networks in the architecture are fully-connected.
WORLD vocoder is in use for audio synthesis.
Auto-regressive connection is exploited with teacher-forcing which introduces a combination of the real ground-truth and the previous time-step network output. They discuss using the ground-truth forcing network to predict the next time step only.

My two cents:

Memory buffer is a cool idea but I am not sure how to employ at cool-start.
Performing no encoding on text sequence seems out of the way when the input is a list of characters instead of phonemes (they used phonemes).
One important nuance is the way of teacher-forcing but needs validation on my side.

Datasets

Final Words

DeepSpeech models seem really complicated. There are many things to consider (Maybe number of people on the paper is a good indicator). I especially suggest you to read the appendixes of these papers before doing anything. On the other hand, DeepSpeech stands as a more generic solution expanding to different languages and speakers. This is where you need to decide your requirements and then choose wisely.

Tacotron models are much more simpler. This might also stems from the brevity of the papers. So might be deceiving to this end. Also it is hard to compare since they only use internal dataset to show comparative results. Nevertheless, Tacotron is my initial choice to start TTS due to its simplicity.

VoiceLoop is a very simple architecture comprised only FC layers a magical Buffer. It has a good way to integrate speaker embedding. One down side is that it uses phonemes as input. It might lead problems on different languages other than Eng. Adding couple of encoder layers for character input might solve this problem but need to be experimented.

After all, in essence, it is easy to implement a TTS system. Main components are;

Projecting audio into a succinct representation space (e.g. Spectrogram).
Define the encoder architecture with less possible use of RNNs. (for the sake of run-time efficiency)
Define a smart attention module aligning to the incremental nature of TTS problem (e.g. Monotonic Attention)
Define the decoder with less use of RNNs.
Find the right vocoder algorithm. It has a big impact at the final results. I warned you.
Simple architectures like VoiceLoop might also give satisfying results with smart tricks.

It is also breath taking to see the whole improvement in a single year. It maybe stems from the commercial importance of TTS systems with the emergence of Alexa type voice assistants.