Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → keonlee9420 → Comprehensive-Tacotron2

keonlee9420 / Comprehensive-Tacotron2

Licence: MIT License

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.

Programming Languages

139335 projects - #7 most used programming language

Labels

text-to-speech deep-learning efficiency pytorch tts speech-synthesis autoregressive multi-speaker robustness comprehensive tacotron single-speaker neural-tts tacotron2 reduction-factor hifi-gan mel-gan diagonal-guided-attention

Projects that are alternatives of or similar to Comprehensive-Tacotron2

Tacotron2-PyTorch

Yet another PyTorch implementation of Tacotron 2 with reduction factor and faster training speed.

Stars: ✭ 118 (+436.36%)

Mutual labels: text-to-speech, tts, tacotron, tacotron2, reduction-factor

Cross-Speaker-Emotion-Transfer

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

Stars: ✭ 107 (+386.36%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Stars: ✭ 66 (+200%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

Desktop application for neural speech synthesis written in C++

Stars: ✭ 140 (+536.36%)

Mutual labels: text-to-speech, tts, speech-synthesis, tacotron2

🤖 💬 Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Stars: ✭ 5,427 (+24568.18%)

Mutual labels: text-to-speech, tts, tacotron, tacotron2

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Stars: ✭ 55 (+150%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Stars: ✭ 41 (+86.36%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

WaveRNN Vocoder + TTS

Stars: ✭ 1,636 (+7336.36%)

Mutual labels: text-to-speech, tts, speech-synthesis, tacotron

😝 TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

Stars: ✭ 2,382 (+10727.27%)

Mutual labels: text-to-speech, tts, speech-synthesis, tacotron2

Official implementation of Meta-StyleSpeech and StyleSpeech

Stars: ✭ 161 (+631.82%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

Parallel-Tacotron2

PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Stars: ✭ 149 (+577.27%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Stars: ✭ 102 (+363.64%)

Mutual labels: tts, tacotron, tacotron2

AdaSpeech: Adaptive Text to Speech for Custom Voice

Stars: ✭ 108 (+390.91%)

Mutual labels: text-to-speech, tts, speech-synthesis

LVCNet: Efficient Condition-Dependent Modeling Network for Waveform Generation

Stars: ✭ 67 (+204.55%)

Mutual labels: text-to-speech, tts, speech-synthesis

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

Stars: ✭ 158 (+618.18%)

Mutual labels: text-to-speech, tts, speech-synthesis

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Stars: ✭ 33 (+50%)

Mutual labels: text-to-speech, tts, speech-synthesis

Text-to-speech browser extension button. Select text on any web page, and have the computer read it out loud for you by simply clicking the Talkie button.

Stars: ✭ 43 (+95.45%)

Mutual labels: text-to-speech, tts, speech-synthesis

open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

Stars: ✭ 841 (+3722.73%)

Mutual labels: text-to-speech, tts, speech-synthesis

Speech synthesis running on ESP32 based on Flite engine.

Stars: ✭ 28 (+27.27%)

Mutual labels: text-to-speech, tts, speech-synthesis

WIP Tensorflow implementation of https://github.com/mozilla/TTS

Stars: ✭ 14 (-36.36%)

Mutual labels: text-to-speech, tts, tacotron

View All Similar Projects ➔

Comprehensive Tacotron2 - PyTorch Implementation

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. Unlike many previous implementations, this is kind of a Comprehensive Tacotron2 where the model supports both single-, multi-speaker TTS and several techniques such as reduction factor to enforce the robustness of the decoder alignment. The model can learn alignment only in 5k.

The validation logs up to 70K of synthesized mel and alignment are shown below (LJSpeech_val_LJ038-0050 and VCTK_val_p323_008 from top to bottom).

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in output/ckpt/LJSpeech/ or output/ckpt/VCTK/.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset LJSpeech

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset VCTK

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step RESTORE_STEP --mode batch --dataset LJSpeech

to synthesize all utterances in preprocessed_data/LJSpeech/val.txt. You can replace LJSpeech with VCTK. Note that only 1 batch size is supported currently due to the autoregressive model architecture.

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker TTS English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS ) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively.

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

Run the preprocessing script

python3 preprocess.py --dataset DATASET

Training

Train your model with

python3 train.py --dataset DATASET

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost.

Implementation Issues

Support n_frames_per_step>1 mode (which is not supported by NVIDIA's tacotron2). This is the key factor to get the robustness of the decoder alignment as described in the paper. Also, it reduces the training & inference time by the factor time.
The current implementation provides pre-trained model of n_frames_per_step==2, but it should also work for any number greater than 2.
Add espnet's implementation of diagonal guided attention loss to force diagonal alignment in the decoder attention module. You can toggle it by setting config.
Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

For the vocoder, the current implementation supports HiFi-GAN and MelGAN, which are much better than WaveNet.
Currently, fp16_run mode is not supported.

Citation

@misc{lee2021comprehensive-tacotron2,
  author = {Lee, Keon},
  title = {Comprehensive-Tacotron2},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/Comprehensive-Tacotron2}}
}

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 22

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗