All Projects → syang1993 → Gst Tacotron

syang1993 / Gst Tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Gst Tacotron

Tacotron asr
Speech Recognition Using Tacotron
Stars: ✭ 165 (-47.28%)
Mutual labels:  tacotron
tacotron2
Pytorch implementation of "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", ICASSP, 2018.
Stars: ✭ 17 (-94.57%)
Mutual labels:  tacotron
mimic2
Text to Speech engine based on the Tacotron architecture, initially implemented by Keith Ito.
Stars: ✭ 537 (+71.57%)
Mutual labels:  tacotron
Tacotron Pytorch
Pytorch implementation of Tacotron
Stars: ✭ 189 (-39.62%)
Mutual labels:  tacotron
Tacotron
A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
Stars: ✭ 2,581 (+724.6%)
Mutual labels:  tacotron
FCH-TTS
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。
Stars: ✭ 154 (-50.8%)
Mutual labels:  tacotron
Xva Synth
Machine learning based speech synthesis Electron app, with voices from specific characters from video games
Stars: ✭ 136 (-56.55%)
Mutual labels:  tacotron
Tacotron pytorch
Tacotron implementation of pytorch
Stars: ✭ 12 (-96.17%)
Mutual labels:  tacotron
Tacotron pytorch
PyTorch implementation of Tacotron speech synthesis model.
Stars: ✭ 242 (-22.68%)
Mutual labels:  tacotron
Tacotron2-PyTorch
Yet another PyTorch implementation of Tacotron 2 with reduction factor and faster training speed.
Stars: ✭ 118 (-62.3%)
Mutual labels:  tacotron
Multi Tacotron Voice Cloning
Phoneme multilingual(Russian-English) voice cloning based on
Stars: ✭ 192 (-38.66%)
Mutual labels:  tacotron
Mimic Recording Studio
Mimic Recording Studio is a Docker-based application you can install to record voice samples, which can then be trained into a TTS voice with Mimic2
Stars: ✭ 202 (-35.46%)
Mutual labels:  tacotron
tacotron2
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow
Stars: ✭ 102 (-67.41%)
Mutual labels:  tacotron
Gst Tacotron
A PyTorch implementation of Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Stars: ✭ 175 (-44.09%)
Mutual labels:  tacotron
Text-to-Speech-Landscape
No description or website provided.
Stars: ✭ 31 (-90.1%)
Mutual labels:  tacotron
Tacotron 2
DeepMind's Tacotron-2 Tensorflow implementation
Stars: ✭ 1,968 (+528.75%)
Mutual labels:  tacotron
vietTTS
Vietnamese Text to Speech library
Stars: ✭ 78 (-75.08%)
Mutual labels:  tacotron
Comprehensive-Tacotron2
PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.
Stars: ✭ 22 (-92.97%)
Mutual labels:  tacotron
ExpressiveTacotron
This repository provides a multi-mode and multi-speaker expressive speech synthesis framework, including multi-attentive Tacotron, DurIAN, Non-attentive Tacotron, GST, VAE, GMVAE, and X-vectors for building prosody encoder.
Stars: ✭ 51 (-83.71%)
Mutual labels:  tacotron
TTS tf
WIP Tensorflow implementation of https://github.com/mozilla/TTS
Stars: ✭ 14 (-95.53%)
Mutual labels:  tacotron

GST Tacotron (expressive end-to-end speech syntheis using global style token)

A tensorflow implementation of the Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis and Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron.

Audio Samples

  • Audio Samples from models trained using this repo with default hyper-params.
    • This set was trained using the Blizzard 2013 dataset with and without global style tokens (GSTs).
      • I found the synthesized audio can learn the prosody of the reference audio.
      • The audio quality isn't so good as the paper. Maybe more data, more training steps and the wavenet vocoder will improve the quality, as well as better attention mechanism.

Quick Start:

Installing dependencies

  1. Install Python 3.

  2. Install the latest version of TensorFlow for your platform. For better performance, install with GPU support if it's available. This code works with TensorFlow 1.4.

  3. Install requirements:

    pip install -r requirements.txt
    

Training

  1. Download a dataset:

    The following are supported out of the box:

    We use the Blizzard 2013 dataset to test this repo (Google's paper used 147 hours data read by the 2013 Blizzard Challenge speaker). This year Challenge provides about 200 hours unsegmented speech and 9741 segmented waveforms, I did all the experiments based the 9741 segmented waveforms since it's hard for me to split the unsegmented data.

    You can use other datasets if you convert them to the right format. See more details about data pre-process in keithito's TRAINING_DATA.md.

  2. Preprocess the data

    python3 preprocess.py --dataset blizzard2013
    
  3. Train a model

    python3 train.py
    

    The above command line will use default hyperparameters, which will train a model with cmudict-based phoneme sequence and 4-head multi-head sytle attention for global style tokens. If you set the use_gst=False in the hparams, it will train a model like Google's another paper Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron.

    Tunable hyperparameters are found in hparams.py. You can adjust these at the command line using the --hparams flag, for example --hparams="batch_size=16,outputs_per_step=2" . Hyperparameters should generally be set to the same values at both training and eval time.

  4. Synthesize from a checkpoint

    python3 eval.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000 --text "hello text" --reference_audio /path/to/ref_audio
    

    Replace "185000" with the checkpoint number that you want to use. Then this command line will synthesize a waveform with the content "hello text" and the style of the reference audio. If you don't use the --reference_audio, it will generate audio with random style weights, which may generate unintelligible audio sometimes.

    If you set the --hparams flag when training, set the same value here.

Notes:

Since the paper didn't talk about the details of the style-attention layer, I'm a little confused about the global style tokens. For the token embedding (GSTs) size, the paper said that they set the size to 256/h, where h is the number of heads. I'm not sure whether I should initialize the same or different GSTs as attention memory for all heads.

Reference

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].