Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart. Objectives of the development are simplicity, modularity, controllability and multilinguality.

Stars: ✭ 295 (+165.77%)

Mutual labels: text-to-speech, speech, tts, speech-synthesis

Fre-GAN-pytorch

Fre-GAN: Adversarial Frequency-consistent Audio Synthesis

Stars: ✭ 73 (-34.23%)

Mutual labels: text-to-speech, speech, tts, speech-synthesis

ttslearn

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

Stars: ✭ 158 (+42.34%)

Mutual labels: text-to-speech, speech, tts, speech-synthesis

AdaSpeech

AdaSpeech: Adaptive Text to Speech for Custom Voice

Stars: ✭ 108 (-2.7%)

Mutual labels: text-to-speech, speech, tts, speech-synthesis

spokestack-android

Extensible Android mobile voice framework: wakeword, ASR, NLU, and TTS. Easily add voice to any Android app!

Stars: ✭ 52 (-53.15%)

Mutual labels: text-to-speech, speech, tts, speech-synthesis

editts

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Stars: ✭ 74 (-33.33%)

Mutual labels: text-to-speech, speech, tts, speech-synthesis

Glow Tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Stars: ✭ 284 (+155.86%)

Mutual labels: speech-synthesis, text-to-speech, tts

Spokestack Python

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.

Stars: ✭ 103 (-7.21%)

Mutual labels: speech-synthesis, text-to-speech, tts

Parakeet

PAddle PARAllel text-to-speech toolKIT (supporting WaveFlow, WaveNet, Transformer TTS and Tacotron2)

Stars: ✭ 279 (+151.35%)

Mutual labels: speech-synthesis, text-to-speech, tts

Cognitive Speech Tts

Microsoft Text-to-Speech API sample code in several languages, part of Cognitive Services.

Stars: ✭ 312 (+181.08%)

Mutual labels: speech-synthesis, text-to-speech, tts

Gtts

Python library and CLI tool to interface with Google Translate's text-to-speech API

Stars: ✭ 1,303 (+1073.87%)

Mutual labels: speech, text-to-speech, tts

Multilingual text to speech

An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.

Stars: ✭ 324 (+191.89%)

Mutual labels: speech-synthesis, text-to-speech, tts

Jsut Lab

HTS-style full-context labels for JSUT v1.1

Stars: ✭ 28 (-74.77%)

Mutual labels: speech-synthesis, text-to-speech, tts

Comprehensive-Tacotron2

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.

Stars: ✭ 22 (-80.18%)

Mutual labels: text-to-speech, tts, speech-synthesis

View All Similar Projects ➔

DurIAN

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.

Status: released

1 Info

DurIAN is encoder-decoder architecture for text-to-speech synthesis task. Unlike prior architectures like Tacotron 2 it doesn't learn attention mechanism but takes into account phoneme durations information. So, of course, to use this model one should have phonemized and duration-aligned dataset. However, you may try to use pretrained duration model on LJSpeech dataset (CMU dict used). Links will be provided below.

2 Architecture details

DurIAN model consists of two modules: backbone synthesizer and duration predictor. Here are some of the most notable differences from DurIAN described in paper:

Prosodic boundary markers aren't used (didn't have them labeled), and thus there's no 'skip states' exclusion of prosodic boundaries' hidden states
Style codes aren't used too (same reason)
Removed Prenet before CBHG encoder (didn't improved accuracy during experiments)
Decoder's recurrent cell outputs single spectrogram frame at a time

Both backbone synthesizer and duration model are trained simultaneously. For implementation simplifications duration model predicts alignment over the fixed max number of frames. You can learn this outputs as BCE problem, MSE problem by summing over frames-axis or to use both losses (haven't tested this one), set it in the config.json. Experiments showed that just-BCE version of optimization process showed itself being unstable with longer text sequences, so prefer using MSE+BCE or just-MSE (don't mind if you get bad alignments in Tensorboard).

3 Reproducibility

You can check the synthesis demo wavfile (was obtained much before convergence) in demo folder (used Waveglow vocoder).

First of all, make sure you have installed all packages using pip install --upgrade -r requirements.txt. The code is tested using pytorch==1.5.0
Clone the repository: git clone https://github.com/ivanvovk/DurrIAN
To start training paper-based DurIAN version run python train.py -c configs/default.json. You can specify to train baseline model as python train.py -c configs/baseline.json --baseline

To make sure that everything works fine at your local environment you may run unit tests in tests folder by python <test_you_want_to_run.py>.

4 Pretrained models

This implementation was trained using phonemized duration-aligned LJSpeech dataset with BCE duration loss minimization. You may find it via this link.

5 Dataset alignment problem

The main drawback of this model is requiring of duration-aligned dataset. You can find parsed LJSpeech filelist used in the training of current implementation in filelists folder. In order to use your data, make sure you have organized your filelists in the same way as provided LJSpeech ones. However, in order to save time and neurons of your brains you may try to train the model on your dataset without duration-aligning using the pretrained on LJSpeech duration model from my model checkpoint (didn't tried). But if you are interested in aligning personal dataset, carefully follow the next section.

6 How to align your own data

In my experiments I aligned LJSpeech with Montreal Forced Alignment tool. If here something will be unclear, please, follow instructions in toolkit's docs. To begin with, aligning algorithm has several steps:

Organize your dataset properly. MFA requires it to be in a single folder of structure {utterance_id.lab, utterance_id.wav}. Make sure all your texts are of .lab format.
Download MFA release and follow installation instructions via this link.
Once done with MFA, you need your dataset words dictionary with phonemes transcriptions. Here you have several options:
1. (Try this first) Download already done dictionary from MFA pretrained models list (at the bottom of the page). In current implementation I have used English Arpabet dictionary. Here can be a problem: if your dataset contains some words missing in the dictionary, MFA may fail to parse it in the future and skip such dataset files. You may skip them or try to preprocess your dataset with accordance to the dictionary or add missing words by hand (if not too much of them).
2. You may generate the dictionary with pretrained G2P model from MFA pretrained models list using the command bin/mfa_generate_dictionary /path/to/model_g2p.zip /path/to/data dict.txt. Notice, that default MFA installation will automatically provide you with English pretrained model, which you may use.
3. In other cases, you'll need to train your own G2P model on your data. In order to train your model follow instructions via this link.
Once you have your data prepared, dictionary and G2P model, now you are ready for aligning. Run the command bin/mfa_align /path/to/data dict.txt path/to/model_g2p.zip outdir. Wait until done. outdir folder will contain a list of out of vocabulary words and a folder with special files of .TextGrid format, where wavs alignments are stored.
Now we want to process these text grid files in order to get the final filelist. Here you may find useful the python package TextGrid. Install it using pip install TextGrid. Here an example how to use it:
```
import textgrid
tg = textgrid.TextGrid.fromFile('./outdir/data/text0.TextGrid')
```
Now tg is the set two objects: first one contains aligned words, second one contains aligned phonemes. You need the second one. Extract durations (in frames! tg has intervals in seconds, thus convert) for whole dataset by iterating over obtained .TextGrid files and prepare a filelist in same format as the ones I provided in filelists folder.

I found an overview of several aligners. Maybe it will be helpful. However, I recommend you to use MFA as it is one of the most accurate aligners, to my best knowledge.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 111

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (6) 🔗