keonlee9420 / Parallel-Tacotron2

Licence: MIT license

PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Parallel-Tacotron2

VAENAR-TTS

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Stars: ✭ 66 (-55.7%)

Mutual labels: text-to-speech, duration, tts, speech-synthesis, vae, self-attention, neural-tts, non-autoregressive

Daft-Exprt

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Stars: ✭ 41 (-72.48%)

Mutual labels: text-to-speech, tts, speech-synthesis, english, neural-tts, non-autoregressive

Cross-Speaker-Emotion-Transfer

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

Stars: ✭ 107 (-28.19%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive, parallel-tacotron

WaveGrad2

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Stars: ✭ 55 (-63.09%)

Mutual labels: text-to-speech, duration, tts, speech-synthesis, neural-tts, non-autoregressive

Comprehensive-Tacotron2

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.

Stars: ✭ 22 (-85.23%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

Tensorflowtts

😝 TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

Stars: ✭ 2,382 (+1498.66%)

Mutual labels: text-to-speech, tts, speech-synthesis, fastspeech

StyleSpeech

Official implementation of Meta-StyleSpeech and StyleSpeech

Stars: ✭ 161 (+8.05%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

Expressive-FastSpeech2

PyTorch Implementation of Non-autoregressive Expressive (emotional, conversational) TTS based on FastSpeech2, supporting English, Korean, and your own languages.

Stars: ✭ 139 (-6.71%)

Mutual labels: text-to-speech, tts, speech-synthesis, non-autoregressive

AdaSpeech

AdaSpeech: Adaptive Text to Speech for Custom Voice

Stars: ✭ 108 (-27.52%)

Mutual labels: text-to-speech, tts, speech-synthesis, fastspeech

open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

Stars: ✭ 841 (+464.43%)

Mutual labels: text-to-speech, tts, speech-synthesis

IMS-Toucan

Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart. Objectives of the development are simplicity, modularity, controllability and multilinguality.

Stars: ✭ 295 (+97.99%)

Mutual labels: text-to-speech, tts, speech-synthesis

react-native-spokestack

Spokestack: give your React Native app a voice interface!

Stars: ✭ 53 (-64.43%)

Mutual labels: text-to-speech, tts, speech-synthesis

Wavegrad

Implementation of Google Brain's WaveGrad high-fidelity vocoder (paper: https://arxiv.org/pdf/2009.00713.pdf). First implementation on GitHub.

Stars: ✭ 245 (+64.43%)

Mutual labels: text-to-speech, tts, speech-synthesis

Marytts

MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java

Stars: ✭ 1,699 (+1040.27%)

Mutual labels: text-to-speech, tts, speech-synthesis

Pytorch Dc Tts

Text to Speech with PyTorch (English and Mongolian)

Stars: ✭ 122 (-18.12%)

Mutual labels: text-to-speech, tts, speech-synthesis

vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Stars: ✭ 1,604 (+976.51%)

Mutual labels: text-to-speech, tts, speech-synthesis

FCH-TTS

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Stars: ✭ 154 (+3.36%)

Mutual labels: tts, english, fastspeech

Crystal

Crystal - C++ implementation of a unified framework for multilingual TTS synthesis engine with SSML specification as interface.

Stars: ✭ 108 (-27.52%)

Mutual labels: text-to-speech, tts, speech-synthesis

Durian

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.

Stars: ✭ 111 (-25.5%)

Mutual labels: text-to-speech, tts, speech-synthesis

Zero-Shot-TTS

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Stars: ✭ 33 (-77.85%)

Mutual labels: text-to-speech, tts, speech-synthesis

View All Similar Projects ➔

Parallel Tacotron2

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Updates

2021.05.25: Only the soft-DTW remains the last hurdle! Following the author's advice on the implementation, I took several tests on each module one by one under a supervised duration signal with L1 loss (FastSpeech2). Until now, I can confirm that all modules except soft-DTW are working well as follows (Synthesized Spectrogram, GT Spectrogram, Residual Alignment, and W from LearnedUpsampling from top to bottom).

For the details, please check the latest commit log and the updated Implementation Issues section. Also, you can find the ongoing experiments at https://github.com/keonlee9420/FastSpeech2/commits/ptaco2.
2021.05.15: Implementation done. Sanity checks on training and inference. But still the model cannot converge.

I'm waiting for your contribution! Please inform me if you find any mistakes in my implementation or any valuable advice to train the model successfully. See the Implementation Issues section.

Training

Requirements

You can install the Python dependencies with
```
pip3 install -r requirements.txt
```
Install fairseq (official document, github) to utilize LConvBlock. Please check #5 to resolve any issue on installing.

Datasets

The supported datasets:

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
(will be added more)

Preprocessing

After downloading the datasets, set the corpus_path in preprocess.yaml and run the preparation script:

python3 prepare_data.py config/LJSpeech/preprocess.yaml

Then, run the preprocessing script:

python3 preprocess.py config/LJSpeech/preprocess.yaml

Training

Train your model with

python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

The model cannot converge yet. I'm debugging but it would be boosted if your awesome contribution is ready!

Inference

For a single inference, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

The generated utterances will be saved in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

to synthesize all utterances in preprocessed_data/LJSpeech/val.txt.

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost.

Implementation Issues

Overall, normalization or activation, which is not suggested in the original paper, is adequately arranged to prevent NaN value (gradient) on forward and backward calculations. (NaN indicates that something is wrong in the network)

Text Encoder

Use the FFTBlock of FastSpeech2 for the transformer block of the text encoder.
Use dropout 0.2 for the ConvBlock of the text encoder.
To restore "proprietary normalization engine",
- Apply the same text normalization as in FastSpeech2.
- Implement grapheme_to_phoneme function. (See ./text/init).

Residual Encoder

Use 80 channels mel-spectrogrom instead of 128-bin.
Regular sinusoidal positional embedding is used in frame-level instead of combinations of three positional embeddings in Parallel Tacotron. As the model depends entirely on unsupervised learning for the position, this choice can be a reason for the fails on model converge.

Duration Predictor & Learned Upsampling

Use nn.SiLU() for the swish activation.
When obtaining W and C, concatenation operation is applied among S, E, and V after frame-domain (T domain) broadcasting of V.

Decoder

Use LConvBlock and regular sinusoidal positional embedding.
Iterative mel-spectrogram is projected by a linear layer.
Apply nn.Tanh() to each LConvBLock output (following activation pattern of decoder part in FastSpeech2).

Loss

Use optimization & scheduler of FastSpeech2 (which is from Attention is all you need as described in the original paper).
Base on pytorch-softdtw-cuda (post) for the soft-DTW.
1. Implement customized soft-DTW in model/soft_dtw_cuda.py, reflecting the recursion suggested in the original paper.
2. In the original soft-DTW, the final loss is not assumed and therefore only E is computed. But employed as a loss function, jacobian product is added to return target derivetive of R w.r.t. input X.
3. Currently, the maximum batch size is 8 in 24GiB GPU (TITAN RTX) due to space complexity problem in soft-DTW Loss.
  - In the original paper, a custom differentiable diagonal band operation was implemented and used to solve the complexity of O(T^2), but this part has not been explored in the current implementation yet.

Citation

@misc{lee2021parallel_tacotron2,
  author = {Lee, Keon},
  title = {Parallel-Tacotron2},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/Parallel-Tacotron2}}
}

References

ming024's FastSpeech2 (Later than 2021.02.26 ver.)
Parallel Tacotron: Non-Autoregressive and Controllable TTS
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

keonlee9420 / Parallel-Tacotron2

Programming Languages

Labels

Projects that are alternatives of or similar to Parallel-Tacotron2

Parallel Tacotron2

Updates

Training

Requirements

Datasets

Preprocessing

Training

Inference

Inference

Batch Inference

TensorBoard

Implementation Issues

Text Encoder

Residual Encoder

Duration Predictor & Learned Upsampling

Decoder

Loss

Citation

References