All Projects β†’ coqui-ai β†’ Tts Papers

coqui-ai / Tts Papers

🐸 collection of TTS papers

Projects that are alternatives of or similar to Tts Papers

Aeneas
aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
Stars: ✭ 1,942 (+1113.75%)
Mutual labels:  speech, tts
Tacotron
A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
Stars: ✭ 1,756 (+997.5%)
Mutual labels:  speech, tts
Tacotron
Audio samples accompanying publications related to Tacotron, an end-to-end speech synthesis model.
Stars: ✭ 493 (+208.13%)
Mutual labels:  speech, tts
Voice Builder
An opensource text-to-speech (TTS) voice building tool
Stars: ✭ 362 (+126.25%)
Mutual labels:  speech, tts
Awesome Speech Recognition Speech Synthesis Papers
Automatic Speech Recognition (ASR), Speaker Verification, Speech Synthesis, Text-to-Speech (TTS), Language Modelling, Singing Voice Synthesis (SVS), Voice Conversion (VC)
Stars: ✭ 2,085 (+1203.13%)
Mutual labels:  papers, tts
Tts
πŸΈπŸ’¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Stars: ✭ 305 (+90.63%)
Mutual labels:  speech, tts
Wsay
Windows "say"
Stars: ✭ 36 (-77.5%)
Mutual labels:  speech, tts
Fre-GAN-pytorch
Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
Stars: ✭ 73 (-54.37%)
Mutual labels:  speech, tts
Gtts
Python library and CLI tool to interface with Google Translate's text-to-speech API
Stars: ✭ 1,303 (+714.38%)
Mutual labels:  speech, tts
Tts
Tools to convert text to speech πŸ“šπŸ’¬
Stars: ✭ 84 (-47.5%)
Mutual labels:  speech, tts
Tts
πŸ€– πŸ’¬ Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Stars: ✭ 5,427 (+3291.88%)
Mutual labels:  speech, tts
Tts
Text-to-Speech for Arduino
Stars: ✭ 118 (-26.25%)
Mutual labels:  speech, tts
Android Speech
Android speech recognition and text to speech made easy
Stars: ✭ 310 (+93.75%)
Mutual labels:  speech, tts
Cboard
AAC communication system with text-to-speech for the browser
Stars: ✭ 437 (+173.13%)
Mutual labels:  speech, tts
editts
Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech
Stars: ✭ 74 (-53.75%)
Mutual labels:  speech, tts
Lightspeech
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search
Stars: ✭ 31 (-80.62%)
Mutual labels:  speech, tts
ttslearn
ttslearn: Library for Pythonで学ぢ音声合成 (Text-to-speech with Python)
Stars: ✭ 158 (-1.25%)
Mutual labels:  speech, tts
spokestack-android
Extensible Android mobile voice framework: wakeword, ASR, NLU, and TTS. Easily add voice to any Android app!
Stars: ✭ 52 (-67.5%)
Mutual labels:  speech, tts
Dc tts
A TensorFlow Implementation of DC-TTS: yet another text-to-speech model
Stars: ✭ 1,017 (+535.63%)
Mutual labels:  speech, tts
Durian
Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.
Stars: ✭ 111 (-30.62%)
Mutual labels:  speech, tts

(Feel free to suggest changes)

Papers

End-to-End Adversarial Text-to-Speech: http://arxiv.org/abs/2006.03575 (Click to Expand)
  • end2end feed-forward TTS learning.
  • Character alignment has been done with a separate aligner module.
  • The aligner predicts length of each character. - The center location of a char is found wrt the total length of the previous characters. - Char positions are interpolated with a Gaussian window wrt the real audio length.
    • audio output is computed in mu-law domain. (I don't have a reasoning for this)
    • use only 2 secs audio windows for traning.
    • GAN-TTS generator is used to produce audio signal.
    • RWD is used as a audio level discriminator.
    • MelD: They use BigGAN-deep architecture as spectrogram level discriminator regading the problem as image reconstruction.
    • Spectrogram loss
      • Using only adversarial feed-back is not enough to learn the char alignments. They use a spectrogram loss b/w predicted spectrograms and ground-truth specs.
      • Note that model predicts audio signals. Spectrograms above are computed from the generated audio.
      • Dynamic Time Wraping is used to compute a minimal-cost alignment b/w generated spectrograms and ground-truth.
      • It involves a dynamic programming approach to find a minimal-cost alignment.
    • Aligner length loss is used to penalize the aligner for predicting different than the real audio length.
    • They train the model with multi speaker dataset but report results on the best performing speaker.
    • Ablation Study importance of each component: (LengthLoss and SpectrogramLoss) > RWD > MelD > Phonemes > MultiSpeakerDataset.
    • My 2 cents: It is a feed forward model which provides end-2-end speech synthesis with no need to train a separate vocoder model. However, it is very complicated model with a lot of hyperparameters and implementation details. Also the final result is not close to the state of the art. I think we need to find specific algorithms for learning character alignments which would reduce the need of tunning a combination of different algorithms.
Fast Speech2: http://arxiv.org/abs/2006.04558 (Click to Expand)
  • Use phoneme durations generated by MFA as labels to train a length regulator.
  • Thay use frame level F0 and L2 spectrogram norms (Variance Information) as additional features.
  • Variance predictor module predicts the variance information at inference time.
  • Ablation study result improvements: model < model + L2_norm < model + L2_norm + F0 image
Glow-TTS: https://arxiv.org/pdf/2005.11129.pdf (Click to Expand)
  • Use Monotonic Alignment Search to learn the alignment b/w text and spectrogram
  • This alignment is used to train a Duration Predictor to be used at inference.
  • Encoder maps each character to a Gaussian Distribution.
  • Decoder maps each spectrogram frame to a latent vector using Normalizing Flow (Glow Layers)
  • Encoder and Decoder outputs are aligned with MAS.
  • At each iteration first the most probable alignment is found by MAS and this alignment is used to update mode parameters.
  • A duration predictor is trained to predict the number of spectrogram frames for each character.
  • At inference only the duration predictor is used instead of MAS
  • Encoder has the architecture of the TTS transformer with 2 updates
  • Instead of absolute positional encoding, they use realtive positional encoding.
  • They also use a residual connection for the Encoder Prenet.
  • Decoder has the same architecture as the Glow model.
  • They train both single and multi-speaker model.
  • It is showed experimentally, Glow-TTS is more robust against long sentences compared to original Tacotron2
  • 15x faster than Tacotron2 at inference
  • My 2 cents: Their samples sound not as natural as Tacotron. I believe normal attention models still generate more natural speech since the attention learns to map characters to model outputs directly. However, using Glow-TTS might be a good alternative for hard datasets.
  • Samples: https://github.com/jaywalnut310/glow-tts
  • Repository: https://github.com/jaywalnut310/glow-tts image
Non-Autoregressive Neural Text-to-Speech: http://arxiv.org/abs/1905.08459 (Click to Expand)
  • A derivation of Deep Voice 3 model using non-causal convolutional layers.
  • Teacher-Student paradigm to train annon-autoregressive student with multiple attention blocks from an autoregressive teacher model.
  • The teacher is used to generate text-to-spectrogram alignments to be used by the student model.
  • The model is trained with two loss functions for attention alignment and spectrogram generation.
  • Multi attention blocks refine the attention alignment layer by layer.
  • The student uses dot-product attention with query, key and value vectors. The query is only positinal encoding vectors. The key and the value are the encoder outputs.
  • Proposed model is heavily tied to the positional encoding which also relies on different constant values. image
Double Decoder Consistency: https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency (Click to Expand)
  • The model uses a Tacotron like architecture but with 2 decoders and a postnet.
  • DDC uses two synchronous decoders using different reduction rates.
  • The decoders use different reduction rates thus they compute outputs in different granularities and learn different aspects of the input data.
  • The model uses the consistency between these two decoders to increase robustness of learned text-to-spectrogram alignment.
  • The model also applies a refinement to the final decoder output by applying the postnet iteratively multiple times.
  • DDC uses Batch Normalization in the prenet module and drops Dropout layers.
  • DDC uses gradual training to reduce the total training time.
  • We use a Multi-Band Melgan Generator as a vocoder trained with Multiple Random Window Discriminators differently than the original work.
  • We are able to train a DDC model only in 2 days with a single GPU and the final model is able to generate faster than real-time speech on a CPU. Demo page: https://erogol.github.io/ddc-samples/ Code: https://github.com/mozilla/TTS image

Multi-Speaker Papers

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation: http://arxiv.org/abs/2005.08024
  • Train a multi-speaker TTS model with only an hour long paired data (text-to-voice alignment) and more unpaired (only voide) data.
  • It learns a code book with each code word corresponds to a single phoneme.
  • The code-book is aligned to phonemes using the paired data and CTC algorithm.
  • This code book functions like a proxy to implicitly estimate the phoneme sequence of the unpaired data.
  • They stack Tacotron2 model on top to perform TTS using the code word embeddings generated by the initial part of the model.
  • They beat the benchmark methods in 1hr long paired data setting.
  • They don't report full paired data results.
  • They don't have a good ablation study which could be interesting to see how different parts of the model contribute to the performance.
  • They use Griffin-Lim as a vocoder thus there is space for improvement.

Demo page: https://ttaoretw.github.io/multispkr-semi-tts/demo.html
Code: https://github.com/ttaoREtw/semi-tts image

Attentron: Few-shot Text-to-Speech Exploiting Attention-based Variable Length Embedding: https://arxiv.org/abs/2005.08484
  • Use two encoders to learn speaker depended features.
  • Coarse encoder learns a global speaker embedding vector based on provided reference spectrograms.
  • Fine encoder learns a variable length embedding keeping the temporal dimention in cooperation with a attention module.
  • The attention selects important reference spectrogram frames to synthesize target speech.
  • Pre-train the model with a single speaker dataset first (LJSpeech for 30k iters.)
  • Fine-tune the model with a multi-speaker dataset. (VCTK for 70k iters.)
  • It achieves slightly better metrics in comparison to using x-vectors from speaker classification model and VAE based reference audio encoder.

Demo page: https://hyperconnect.github.io/Attentron/
image image

Attention

Vocoders

WaveGrad: https://arxiv.org/pdf/2009.00713.pdf
  • It is based on Probability Diffusion and Lagenvin Dynamics
  • The base idea is to learn a function that maps a known distribution to target data distribution iteratively.
  • They report 0.2 real-time factor on a GPU but CPU performance is not shared.
  • In the example code below, the author reports that the model converges after 2 days of training on a single GPU.
  • MOS scores on the paper are not compherensive enough but shows comparable performance to known models like WaveRNN and WaveNet.

Code: https://github.com/ivanvovk/WaveGrad image

From the Internet (Blogs, Videos etc)

Videos

Paper Discussion

Talks

General

Blogs

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].