All Projects β†’ Kyubyong β†’ Tacotron

Kyubyong / Tacotron

Licence: apache-2.0
A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Tacotron

Tts
πŸ€– πŸ’¬ Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Stars: ✭ 5,427 (+209.05%)
Mutual labels:  speech, tts
Tacotron
Audio samples accompanying publications related to Tacotron, an end-to-end speech synthesis model.
Stars: ✭ 493 (-71.92%)
Mutual labels:  speech, tts
Voice Builder
An opensource text-to-speech (TTS) voice building tool
Stars: ✭ 362 (-79.38%)
Mutual labels:  speech, tts
Fre-GAN-pytorch
Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
Stars: ✭ 73 (-95.84%)
Mutual labels:  speech, tts
Tts
Tools to convert text to speech πŸ“šπŸ’¬
Stars: ✭ 84 (-95.22%)
Mutual labels:  speech, tts
editts
Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech
Stars: ✭ 74 (-95.79%)
Mutual labels:  speech, tts
Cboard
AAC communication system with text-to-speech for the browser
Stars: ✭ 437 (-75.11%)
Mutual labels:  speech, tts
Zero-Shot-TTS
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Stars: ✭ 33 (-98.12%)
Mutual labels:  speech, tts
Dc tts
A TensorFlow Implementation of DC-TTS: yet another text-to-speech model
Stars: ✭ 1,017 (-42.08%)
Mutual labels:  speech, tts
Wsay
Windows "say"
Stars: ✭ 36 (-97.95%)
Mutual labels:  speech, tts
spokestack-android
Extensible Android mobile voice framework: wakeword, ASR, NLU, and TTS. Easily add voice to any Android app!
Stars: ✭ 52 (-97.04%)
Mutual labels:  speech, tts
Durian
Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.
Stars: ✭ 111 (-93.68%)
Mutual labels:  speech, tts
ttslearn
ttslearn: Library for Pythonで学ぢ音声合成 (Text-to-speech with Python)
Stars: ✭ 158 (-91%)
Mutual labels:  speech, tts
Android Speech
Android speech recognition and text to speech made easy
Stars: ✭ 310 (-82.35%)
Mutual labels:  speech, tts
AdaSpeech
AdaSpeech: Adaptive Text to Speech for Custom Voice
Stars: ✭ 108 (-93.85%)
Mutual labels:  speech, tts
Tts
πŸΈπŸ’¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Stars: ✭ 305 (-82.63%)
Mutual labels:  speech, tts
simple-obs-stt
Speech-to-text and keyboard input captions for OBS.
Stars: ✭ 89 (-94.93%)
Mutual labels:  speech, tts
StyleSpeech
Official implementation of Meta-StyleSpeech and StyleSpeech
Stars: ✭ 161 (-90.83%)
Mutual labels:  speech, tts
Lightspeech
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search
Stars: ✭ 31 (-98.23%)
Mutual labels:  speech, tts
Gtts
Python library and CLI tool to interface with Google Translate's text-to-speech API
Stars: ✭ 1,303 (-25.8%)
Mutual labels:  speech, tts

A (Heavily Documented) TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model

Requirements

  • NumPy >= 1.11.1
  • TensorFlow >= 1.3
  • librosa
  • tqdm
  • matplotlib
  • scipy

Data

We train the model on three different speech datasets.

  1. LJ Speech Dataset
  2. Nick Offerman's Audiobooks
  3. The World English Bible

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available. It has 24 hours of reasonable quality samples. Nick's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours long. The World English Bible is a public domain update of the American Standard Version of 1901 into modern English. Its original audios are freely available here. Kyubyong split each chapter by verse manually and aligned the segmented audio clips to the text. They are 72 hours in total. You can download them at Kaggle Datasets.

Training

  • STEP 0. Download LJ Speech Dataset or prepare your own data.
  • STEP 1. Adjust hyper parameters in hyperparams.py. (If you want to do preprocessing, set prepro True`.
  • STEP 2. Run python train.py. (If you set prepro True, run python prepro.py first)
  • STEP 3. Run python eval.py regularly during training.

Sample Synthesis

We generate speech samples based on Harvard Sentences as the original paper does. It is already included in the repo.

  • Run python synthesize.py and check the files in samples.

Training Curve

Attention Plot

Generated Samples

Pretrained Files

  • Keep in mind 200k steps may not be enough for the best performance.
  • LJ 200k
  • WEB 200k

Notes

  • It's important to monitor the attention plots during training. If the attention plots look good (alignment looks linear), and then they look bad (the plots will look similar to what they looked like in the begining of training), then training has gone awry and most likely will need to be restarted from a checkpoint where the attention looked good, because we've learned that it's unlikely that the loss will ever recover. This deterioration of attention will correspond with a spike in the loss.

  • In the original paper, the authors said, "An important trick we discovered was predicting multiple, non-overlapping output frames at each decoder step" where the number of of multiple frame is the reduction factor, r. We originally interpretted this as predicting non-sequential frames during each decoding step t. Thus were using the following scheme (with r=5) during decoding.

    t    frame numbers
    -----------------------
    0    [ 0  1  2  3  4]
    1    [ 5  6  7  8  9]
    2    [10 11 12 13 14]
    ...
    

    After much experimentation, we were unable to have our model learning anything useful. We then switched to predicting r sequential frames during each decoding step.

    t    frame numbers
    -----------------------
    0    [ 0  1  2  3  4]
    1    [ 5  6  7  8  9]
    2    [10 11 12 13 14]
    ...
    

    With this setup we noticed improvements in the attention and have since kept it.

  • Perhaps the most important hyperparemeter is the learning rate. With an intitial learning rate of 0.002 we were never able to learn a clean attention, the loss would frequently explode. With an initial learning rate of 0.001 we were able to learn a clean attention and train for much longer get decernable words during synthesis.

  • Check other TTS models such as DCTTS or deep voice 3.

Differences from the original paper

  • We use Noam style warmup and decay.
  • We implement gradient clipping.
  • Our training batches are bucketed.
  • After the last convolutional layer of the post-processing net, we apply an affine transformation to bring the dimensionality up to 128 from 80, because the required dimensionality of highway net is 128. In the original highway networks paper, the authors mention that the dimensionality of the input can also be increased with zero-padding, but they used the affine transformation in all their experiments. We do not know what the Tacotron authors chose.

Papers that referenced this repo

Jan. 2018, Kyubyong Park & Tommy Mulc

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].