Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → Kyubyong → Expressive_tacotron

Kyubyong / Expressive_tacotron

Tensorflow Implementation of Expressive Tacotron

Programming Languages

python

139335 projects - #7 most used programming language

Labels

speech-to-text speech-synthesis tacotron

Projects that are alternatives of or similar to Expressive tacotron

Java Speech Api

The J.A.R.V.I.S. Speech API is designed to be simple and efficient, using the speech engines created by Google to provide functionality for parts of the API. Essentially, it is an API written in Java, including a recognizer, synthesizer, and a microphone capture utility. The project uses Google services for the synthesizer and recognizer. While this requires an Internet connection, it provides a complete, modern, and fully functional speech API in Java.

Stars: ✭ 490 (+155.21%)

Mutual labels: speech-to-text, speech-synthesis

Tacotron asr

Speech Recognition Using Tacotron

Stars: ✭ 165 (-14.06%)

Mutual labels: speech-to-text, tacotron

Artyom.js

A voice control - voice commands - speech recognition and speech synthesis javascript library. Create your own siri,google now or cortana with Google Chrome within your website.

Stars: ✭ 1,011 (+426.56%)

Mutual labels: speech-to-text, speech-synthesis

Tacotron 2

DeepMind's Tacotron-2 Tensorflow implementation

Stars: ✭ 1,968 (+925%)

Mutual labels: speech-synthesis, tacotron

Kalliope

Kalliope is a framework that will help you to create your own personal assistant.

Stars: ✭ 1,509 (+685.94%)

Mutual labels: speech-to-text, speech-synthesis

Tacotron pytorch

Tacotron implementation of pytorch

Stars: ✭ 12 (-93.75%)

Mutual labels: speech-synthesis, tacotron

Openseq2seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

Stars: ✭ 1,378 (+617.71%)

Mutual labels: speech-to-text, speech-synthesis

open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

Stars: ✭ 841 (+338.02%)

Mutual labels: speech-synthesis, speech-to-text

Wavernn

WaveRNN Vocoder + TTS

Stars: ✭ 1,636 (+752.08%)

Mutual labels: speech-synthesis, tacotron

Tacotron Pytorch

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Stars: ✭ 104 (-45.83%)

Mutual labels: speech-synthesis, tacotron

leon

🧠 Leon is your open-source personal assistant.

Stars: ✭ 8,560 (+4358.33%)

Mutual labels: speech-synthesis, speech-to-text

Xva Synth

Machine learning based speech synthesis Electron app, with voices from specific characters from video games

Stars: ✭ 136 (-29.17%)

Mutual labels: speech-synthesis, tacotron

mimic2

Text to Speech engine based on the Tacotron architecture, initially implemented by Keith Ito.

Stars: ✭ 537 (+179.69%)

Mutual labels: speech-synthesis, tacotron

Comprehensive-Tacotron2

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.

Stars: ✭ 22 (-88.54%)

Mutual labels: speech-synthesis, tacotron

spokestack-ios

Spokestack: give your iOS app a voice interface!

Stars: ✭ 27 (-85.94%)

Mutual labels: speech-synthesis, speech-to-text

Tacotron2

A PyTorch implementation of Tacotron2, an end-to-end text-to-speech(TTS) system described in "Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions".

Stars: ✭ 43 (-77.6%)

Mutual labels: speech-synthesis, tacotron

speechrec

a simple speech recognition app using the Web Speech API Interfaces

Stars: ✭ 18 (-90.62%)

Mutual labels: speech-synthesis, speech-to-text

AmazonSpeechTranslator

End-to-end Solution for Speech Recognition, Text Translation, and Text-to-Speech for iOS using Amazon Translate and Amazon Polly as AWS Machine Learning managed services.

Stars: ✭ 50 (-73.96%)

Mutual labels: speech-synthesis, speech-to-text

Spokestack Python

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.

Stars: ✭ 103 (-46.35%)

Mutual labels: speech-to-text, speech-synthesis

Awesome Ai Services

An overview of the AI-as-a-service landscape

Stars: ✭ 133 (-30.73%)

Mutual labels: speech-to-text, speech-synthesis

View All Similar Projects ➔

A TensorFlow Implementation of Expressive Tacotron

This project aims at implementing the paper, Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron, to verify its concept. Most of the baseline codes are based on my previous Tacotron implementation.

Requirements

NumPy >= 1.11.1
TensorFlow >= 1.3
librosa
tqdm
matplotlib
scipy

Data

Because the paper used their internal data, I train the model on the LJ Speech Dataset

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available. It has 24 hours of reasonable quality samples.

Training

STEP 0. Download LJ Speech Dataset or prepare your own data.
STEP 1. Adjust hyper parameters in hyperparams.py. (If you want to do preprocessing, set prepro True`.
STEP 2. Run python train.py. (If you set prepro True, run python prepro.py first)
STEP 3. Run python eval.py regularly during training.

Sample Synthesis

I generate speech samples based on the same script as the one used for the original web demo. You can check it in test_sents.txt.

Run python synthesize.py and check the files in samples.

Samples

16 sample sentences in the first chapter of the original web demo are collected for sample synthesis. Two audio clips per sentence are used for prosody embedding--reference voice and base voice. Mostly, those two are different in terms of gender or region. The samples below look like the following:

1a: the first reference audio
1b: sample embedded with 1a's prosody
1c: the second reference audio (base)
1d: sample embedded with 1c's prosody

Check out the samples at each steps.

Analysis

Hearing the results of 130k steps, it's not clear if the model has learned the prosody.
It's clear that different reference audios cause different samples.
Some samples are worthy of note. For example, listen to the four audios of no.15. The stress of "right" part was obvious transferred.
Check out no.9, reference audios of which are sung. They are fun.

Notes

Because this repo focuses on investigating the concept of the paper, I did not follow some details of the paper.
The paper used phoneme inputs, whereas I stuck to graphemes.
Instead of the Bahdanau attention, the paper used the GMM attention.
The original audio samples were obtained from wavenet vocoder.
I'm still confused what the paper claims to be a prosody embedding can be isolated from the speaker.
For prosody embedding, the authors employed conv2d. Why not conv1d?
When the reference audio's text or sentence structure is totally different from the inference script, what happens?
If I have time, I'd like to implement their 2nd paper: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

April 2018, Kyubyong Park

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 192

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (12) 🔗