Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → keonlee9420 → Daft-Exprt

keonlee9420 / Daft-Exprt

Licence: MIT license

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Programming Languages

139335 projects - #7 most used programming language

14818 projects

Labels

text-to-speech style pytorch tts speech-synthesis english speaker prosody neural-tts non-autoregressive prosody-transfer gaussian-upsampling

Projects that are alternatives of or similar to Daft-Exprt

Parallel-Tacotron2

PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Stars: ✭ 149 (+263.41%)

Mutual labels: text-to-speech, tts, speech-synthesis, english, neural-tts, non-autoregressive

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Stars: ✭ 55 (+34.15%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive

Cross-Speaker-Emotion-Transfer

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

Stars: ✭ 107 (+160.98%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Stars: ✭ 66 (+60.98%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive

Expressive-FastSpeech2

PyTorch Implementation of Non-autoregressive Expressive (emotional, conversational) TTS based on FastSpeech2, supporting English, Korean, and your own languages.

Stars: ✭ 139 (+239.02%)

Mutual labels: text-to-speech, tts, speech-synthesis, non-autoregressive

Official implementation of Meta-StyleSpeech and StyleSpeech

Stars: ✭ 161 (+292.68%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

Comprehensive-Tacotron2

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.

Stars: ✭ 22 (-46.34%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Stars: ✭ 33 (-19.51%)

Mutual labels: text-to-speech, tts, speech-synthesis

Crystal - C++ implementation of a unified framework for multilingual TTS synthesis engine with SSML specification as interface.

Stars: ✭ 108 (+163.41%)

Mutual labels: text-to-speech, tts, speech-synthesis

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.

Stars: ✭ 111 (+170.73%)

Mutual labels: text-to-speech, tts, speech-synthesis

Text to Speech with PyTorch (English and Mongolian)

Stars: ✭ 122 (+197.56%)

Mutual labels: text-to-speech, tts, speech-synthesis

MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java

Stars: ✭ 1,699 (+4043.9%)

Mutual labels: text-to-speech, tts, speech-synthesis

😝 TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

Stars: ✭ 2,382 (+5709.76%)

Mutual labels: text-to-speech, tts, speech-synthesis

Desktop application for neural speech synthesis written in C++

Stars: ✭ 140 (+241.46%)

Mutual labels: text-to-speech, tts, speech-synthesis

WaveRNN Vocoder + TTS

Stars: ✭ 1,636 (+3890.24%)

Mutual labels: text-to-speech, tts, speech-synthesis

Spokestack Python

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.

Stars: ✭ 103 (+151.22%)

Mutual labels: text-to-speech, tts, speech-synthesis

Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart. Objectives of the development are simplicity, modularity, controllability and multilinguality.

Stars: ✭ 295 (+619.51%)

Mutual labels: text-to-speech, tts, speech-synthesis

react-native-spokestack

Spokestack: give your React Native app a voice interface!

Stars: ✭ 53 (+29.27%)

Mutual labels: text-to-speech, tts, speech-synthesis

open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

Stars: ✭ 841 (+1951.22%)

Mutual labels: text-to-speech, tts, speech-synthesis

Windows "say"

Stars: ✭ 36 (-12.2%)

Mutual labels: text-to-speech, tts, speech-synthesis

View All Similar Projects ➔

Daft-Exprt - PyTorch Implementation

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

The validation logs up to 70K of synthesized mel and alignment are shown below (VCTK_val_p237-088).

Quickstart

DATASET refers to the names of datasets such as VCTK in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/.

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET --ref_audio REF_AUDIO

to synthesize speech with the style of input audio at REF_AUDIO. The dictionary of learned speakers can be found at preprocessed_data/VCTK/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances consuming themselves as a reference audio in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET --ref_audio REF_AUDIO --duration_control 0.8 --energy_control 0.8

Training

Datasets

The supported datasets are

VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
Any of multi-speaker TTS dataset (e.g., LibriTTS) can be added following VCTK.

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.
Run
```
python3 prepare_align.py --dataset DATASET
```
for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by
```
python3 preprocess.py --dataset DATASET
```

Training

Train your model with

python3 train.py --dataset DATASET

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Implementation Issues

RangeParameterPredictor is built with BiLSTM rather than a single linear layer with softplus() activation (it is however implemented and named as 'range_param_predictor_paper' in GaussianUpsampling).
Use 16 batch size instead of 48 due to memory issues.
Use log duration instead of normal duration.
Follow FastSpeech2 for the preprocess of pitch and energy.
Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

For vocoder, HiFi-GAN and MelGAN are supported.

Citation

@misc{lee2021daft_exprt,
  author = {Lee, Keon},
  title = {Daft-Exprt},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/Daft-Exprt}}
}

References

keonlee9420's WaveGrad2 for GaussianUpsampling and RangeParameterPredictor
keonlee9420's STYLER for the (domain) adversarial training of SpeakerClassifier
keonlee9420's StyleSpeech for reference auido interface
FiLM: Visual Reasoning with a General Conditioning Layer
TADAM: Task dependent adaptive metric for improved few-shot learning

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 41

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗