All Projects → keonlee9420 → Daft-Exprt

keonlee9420 / Daft-Exprt

Licence: MIT license
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to Daft-Exprt

Parallel-Tacotron2
PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
Stars: ✭ 149 (+263.41%)
Mutual labels:  text-to-speech, tts, speech-synthesis, english, neural-tts, non-autoregressive
WaveGrad2
PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Stars: ✭ 55 (+34.15%)
Mutual labels:  text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive
Cross-Speaker-Emotion-Transfer
PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech
Stars: ✭ 107 (+160.98%)
Mutual labels:  text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive
VAENAR-TTS
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.
Stars: ✭ 66 (+60.98%)
Mutual labels:  text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive
Expressive-FastSpeech2
PyTorch Implementation of Non-autoregressive Expressive (emotional, conversational) TTS based on FastSpeech2, supporting English, Korean, and your own languages.
Stars: ✭ 139 (+239.02%)
Mutual labels:  text-to-speech, tts, speech-synthesis, non-autoregressive
StyleSpeech
Official implementation of Meta-StyleSpeech and StyleSpeech
Stars: ✭ 161 (+292.68%)
Mutual labels:  text-to-speech, tts, speech-synthesis, neural-tts
Comprehensive-Tacotron2
PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.
Stars: ✭ 22 (-46.34%)
Mutual labels:  text-to-speech, tts, speech-synthesis, neural-tts
Zero-Shot-TTS
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Stars: ✭ 33 (-19.51%)
Mutual labels:  text-to-speech, tts, speech-synthesis
Crystal
Crystal - C++ implementation of a unified framework for multilingual TTS synthesis engine with SSML specification as interface.
Stars: ✭ 108 (+163.41%)
Mutual labels:  text-to-speech, tts, speech-synthesis
Durian
Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.
Stars: ✭ 111 (+170.73%)
Mutual labels:  text-to-speech, tts, speech-synthesis
Pytorch Dc Tts
Text to Speech with PyTorch (English and Mongolian)
Stars: ✭ 122 (+197.56%)
Mutual labels:  text-to-speech, tts, speech-synthesis
Marytts
MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java
Stars: ✭ 1,699 (+4043.9%)
Mutual labels:  text-to-speech, tts, speech-synthesis
Tensorflowtts
😝 TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
Stars: ✭ 2,382 (+5709.76%)
Mutual labels:  text-to-speech, tts, speech-synthesis
TensorVox
Desktop application for neural speech synthesis written in C++
Stars: ✭ 140 (+241.46%)
Mutual labels:  text-to-speech, tts, speech-synthesis
Wavernn
WaveRNN Vocoder + TTS
Stars: ✭ 1,636 (+3890.24%)
Mutual labels:  text-to-speech, tts, speech-synthesis
Spokestack Python
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.
Stars: ✭ 103 (+151.22%)
Mutual labels:  text-to-speech, tts, speech-synthesis
IMS-Toucan
Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart. Objectives of the development are simplicity, modularity, controllability and multilinguality.
Stars: ✭ 295 (+619.51%)
Mutual labels:  text-to-speech, tts, speech-synthesis
react-native-spokestack
Spokestack: give your React Native app a voice interface!
Stars: ✭ 53 (+29.27%)
Mutual labels:  text-to-speech, tts, speech-synthesis
open-speech-corpora
💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies
Stars: ✭ 841 (+1951.22%)
Mutual labels:  text-to-speech, tts, speech-synthesis
Wsay
Windows "say"
Stars: ✭ 36 (-12.2%)
Mutual labels:  text-to-speech, tts, speech-synthesis

Daft-Exprt - PyTorch Implementation

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

The validation logs up to 70K of synthesized mel and alignment are shown below (VCTK_val_p237-088).

Quickstart

DATASET refers to the names of datasets such as VCTK in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/.

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET --ref_audio REF_AUDIO

to synthesize speech with the style of input audio at REF_AUDIO. The dictionary of learned speakers can be found at preprocessed_data/VCTK/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances consuming themselves as a reference audio in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET --ref_audio REF_AUDIO --duration_control 0.8 --energy_control 0.8

Training

Datasets

The supported datasets are

  • VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
  • Any of multi-speaker TTS dataset (e.g., LibriTTS) can be added following VCTK.

Preprocessing

  • For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

  • Run

    python3 prepare_align.py --dataset DATASET
    

    for some preparations.

    For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

    After that, run the preprocessing script by

    python3 preprocess.py --dataset DATASET
    

Training

Train your model with

python3 train.py --dataset DATASET

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Implementation Issues

  • RangeParameterPredictor is built with BiLSTM rather than a single linear layer with softplus() activation (it is however implemented and named as 'range_param_predictor_paper' in GaussianUpsampling).
  • Use 16 batch size instead of 48 due to memory issues.
  • Use log duration instead of normal duration.
  • Follow FastSpeech2 for the preprocess of pitch and energy.
  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

  • For vocoder, HiFi-GAN and MelGAN are supported.

Citation

@misc{lee2021daft_exprt,
  author = {Lee, Keon},
  title = {Daft-Exprt},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/Daft-Exprt}}
}

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].