Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → syang1993 → Gst Tacotron

syang1993 / Gst Tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"

Programming Languages

python

139335 projects - #7 most used programming language

Labels

tacotron

Projects that are alternatives of or similar to Gst Tacotron

Tacotron asr

Speech Recognition Using Tacotron

Stars: ✭ 165 (-47.28%)

Mutual labels: tacotron

tacotron2

Pytorch implementation of "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", ICASSP, 2018.

Stars: ✭ 17 (-94.57%)

Mutual labels: tacotron

mimic2

Text to Speech engine based on the Tacotron architecture, initially implemented by Keith Ito.

Stars: ✭ 537 (+71.57%)

Mutual labels: tacotron

Tacotron Pytorch

Pytorch implementation of Tacotron

Stars: ✭ 189 (-39.62%)

Mutual labels: tacotron

Tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Stars: ✭ 2,581 (+724.6%)

Mutual labels: tacotron

FCH-TTS

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Stars: ✭ 154 (-50.8%)

Mutual labels: tacotron

Xva Synth

Machine learning based speech synthesis Electron app, with voices from specific characters from video games

Stars: ✭ 136 (-56.55%)

Mutual labels: tacotron

Tacotron pytorch

Tacotron implementation of pytorch

Stars: ✭ 12 (-96.17%)

Mutual labels: tacotron

Tacotron pytorch

PyTorch implementation of Tacotron speech synthesis model.

Stars: ✭ 242 (-22.68%)

Mutual labels: tacotron

Tacotron2-PyTorch

Yet another PyTorch implementation of Tacotron 2 with reduction factor and faster training speed.

Stars: ✭ 118 (-62.3%)

Mutual labels: tacotron

Multi Tacotron Voice Cloning

Phoneme multilingual(Russian-English) voice cloning based on

Stars: ✭ 192 (-38.66%)

Mutual labels: tacotron

Mimic Recording Studio

Mimic Recording Studio is a Docker-based application you can install to record voice samples, which can then be trained into a TTS voice with Mimic2

Stars: ✭ 202 (-35.46%)

Mutual labels: tacotron

tacotron2

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Stars: ✭ 102 (-67.41%)

Mutual labels: tacotron

Gst Tacotron

A PyTorch implementation of Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Stars: ✭ 175 (-44.09%)

Mutual labels: tacotron

Text-to-Speech-Landscape

No description or website provided.

Stars: ✭ 31 (-90.1%)

Mutual labels: tacotron

Tacotron 2

DeepMind's Tacotron-2 Tensorflow implementation

Stars: ✭ 1,968 (+528.75%)

Mutual labels: tacotron

vietTTS

Vietnamese Text to Speech library

Stars: ✭ 78 (-75.08%)

Mutual labels: tacotron

Comprehensive-Tacotron2

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.

Stars: ✭ 22 (-92.97%)

Mutual labels: tacotron

ExpressiveTacotron

This repository provides a multi-mode and multi-speaker expressive speech synthesis framework, including multi-attentive Tacotron, DurIAN, Non-attentive Tacotron, GST, VAE, GMVAE, and X-vectors for building prosody encoder.

Stars: ✭ 51 (-83.71%)

Mutual labels: tacotron

TTS tf

WIP Tensorflow implementation of https://github.com/mozilla/TTS

Stars: ✭ 14 (-95.53%)

Mutual labels: tacotron

View All Similar Projects ➔

GST Tacotron (expressive end-to-end speech syntheis using global style token)

A tensorflow implementation of the Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis and Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron.

Audio Samples

Audio Samples from models trained using this repo with default hyper-params.
- This set was trained using the Blizzard 2013 dataset with and without global style tokens (GSTs).
  - I found the synthesized audio can learn the prosody of the reference audio.
  - The audio quality isn't so good as the paper. Maybe more data, more training steps and the wavenet vocoder will improve the quality, as well as better attention mechanism.

Quick Start:

Installing dependencies

Install Python 3.
Install the latest version of TensorFlow for your platform. For better performance, install with GPU support if it's available. This code works with TensorFlow 1.4.
Install requirements:
```
pip install -r requirements.txt
```

Training

Download a dataset:

The following are supported out of the box:
- LJ Speech (Public Domain)
- Blizzard 2013 (Creative Commons Attribution Share-Alike)
We use the Blizzard 2013 dataset to test this repo (Google's paper used 147 hours data read by the 2013 Blizzard Challenge speaker). This year Challenge provides about 200 hours unsegmented speech and 9741 segmented waveforms, I did all the experiments based the 9741 segmented waveforms since it's hard for me to split the unsegmented data.

You can use other datasets if you convert them to the right format. See more details about data pre-process in keithito's TRAINING_DATA.md.

Preprocess the data

python3 preprocess.py --dataset blizzard2013

Train a model
```
python3 train.py
```
The above command line will use default hyperparameters, which will train a model with cmudict-based phoneme sequence and 4-head multi-head sytle attention for global style tokens. If you set the use_gst=False in the hparams, it will train a model like Google's another paper Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron.

Tunable hyperparameters are found in hparams.py. You can adjust these at the command line using the --hparams flag, for example --hparams="batch_size=16,outputs_per_step=2" . Hyperparameters should generally be set to the same values at both training and eval time.
Synthesize from a checkpoint
```
python3 eval.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000 --text "hello text" --reference_audio /path/to/ref_audio
```
Replace "185000" with the checkpoint number that you want to use. Then this command line will synthesize a waveform with the content "hello text" and the style of the reference audio. If you don't use the --reference_audio, it will generate audio with random style weights, which may generate unintelligible audio sometimes.

If you set the --hparams flag when training, set the same value here.

Notes:

Since the paper didn't talk about the details of the style-attention layer, I'm a little confused about the global style tokens. For the token embedding (GSTs) size, the paper said that they set the size to 256/h, where h is the number of heads. I'm not sure whether I should initialize the same or different GSTs as attention memory for all heads.

Reference

Keithito's implementation of tacotron: https://github.com/keithito/tacotron
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous. 2018. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous. 2018. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 313

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (30) 🔗