Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → keonlee9420 → Cross-Speaker-Emotion-Transfer

keonlee9420 / Cross-Speaker-Emotion-Transfer

Licence: MIT license

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

Programming Languages

139335 projects - #7 most used programming language

14818 projects

Labels

text-to-speech deep-neural-networks pytorch tts speech-synthesis generative-model semi-supervised-learning global-style-tokens neural-tts non-autoregressive parallel-tacotron non-ar emotion-transfer cross-speaker conditional-layer-normalization

Projects that are alternatives of or similar to Cross-Speaker-Emotion-Transfer

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Stars: ✭ 66 (-38.32%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive, non-ar

Parallel-Tacotron2

PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Stars: ✭ 149 (+39.25%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive, parallel-tacotron

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Stars: ✭ 41 (-61.68%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Stars: ✭ 55 (-48.6%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts, non-autoregressive

Expressive-FastSpeech2

PyTorch Implementation of Non-autoregressive Expressive (emotional, conversational) TTS based on FastSpeech2, supporting English, Korean, and your own languages.

Stars: ✭ 139 (+29.91%)

Mutual labels: text-to-speech, tts, speech-synthesis, non-autoregressive

Comprehensive-Tacotron2

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.

Stars: ✭ 22 (-79.44%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

Official implementation of Meta-StyleSpeech and StyleSpeech

Stars: ✭ 161 (+50.47%)

Mutual labels: text-to-speech, tts, speech-synthesis, neural-tts

Cs224n Gpu That Talks

Attention, I'm Trying to Speak: End-to-end speech synthesis (CS224n '18)

Stars: ✭ 52 (-51.4%)

Mutual labels: text-to-speech, tts, speech-synthesis

Spokestack Python

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.

Stars: ✭ 103 (-3.74%)

Mutual labels: text-to-speech, tts, speech-synthesis

Desktop application for neural speech synthesis written in C++

Stars: ✭ 140 (+30.84%)

Mutual labels: text-to-speech, tts, speech-synthesis

Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart. Objectives of the development are simplicity, modularity, controllability and multilinguality.

Stars: ✭ 295 (+175.7%)

Mutual labels: text-to-speech, tts, speech-synthesis

Windows "say"

Stars: ✭ 36 (-66.36%)

Mutual labels: text-to-speech, tts, speech-synthesis

HTS-style full-context labels for JSUT v1.1

Stars: ✭ 28 (-73.83%)

Mutual labels: text-to-speech, tts, speech-synthesis

WaveRNN Vocoder + TTS

Stars: ✭ 1,636 (+1428.97%)

Mutual labels: text-to-speech, tts, speech-synthesis

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.

Stars: ✭ 111 (+3.74%)

Mutual labels: text-to-speech, tts, speech-synthesis

Implementation of Google Brain's WaveGrad high-fidelity vocoder (paper: https://arxiv.org/pdf/2009.00713.pdf). First implementation on GitHub.

Stars: ✭ 245 (+128.97%)

Mutual labels: text-to-speech, tts, speech-synthesis

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Stars: ✭ 31 (-71.03%)

Mutual labels: text-to-speech, tts, speech-synthesis

Crystal - C++ implementation of a unified framework for multilingual TTS synthesis engine with SSML specification as interface.

Stars: ✭ 108 (+0.93%)

Mutual labels: text-to-speech, tts, speech-synthesis

😝 TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

Stars: ✭ 2,382 (+2126.17%)

Mutual labels: text-to-speech, tts, speech-synthesis

MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java

Stars: ✭ 1,699 (+1487.85%)

Mutual labels: text-to-speech, tts, speech-synthesis

View All Similar Projects ➔

Cross-Speaker-Emotion-Transfer - PyTorch Implementation

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech.

Audio Samples

Audio samples are available at /demo.

Quickstart

DATASET refers to the names of datasets such as RAVDESS in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, install fairseq (official document, github) to utilize LConvBlock. Please check here to resolve any issue on installing it. Note that Dockerfile is provided for Docker users, but you have to install fairseq manually.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/.

To extract soft emotion tokens from a reference audio, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --ref_audio REF_AUDIO_PATH --restore_step RESTORE_STEP --mode single --dataset DATASET

Or, to use hard emotion tokens from an emotion id, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt. Please note that only the hard emotion tokens from a given emotion id are supported in this mode.

Training

Datasets

The supported datasets are

RAVDESS: This portion of the RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

Your own language and dataset can be adapted following here.

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.
Run
```
python3 prepare_align.py --dataset DATASET
```
for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by
```
python3 preprocess.py --dataset DATASET
```

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

To use Automatic Mixed Precision, append --use_amp argument to the above command.
The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Notes

The current implementation is not trained in a semi-supervised way due to the small dataset size. But it can be easily activated by specifying target speakers and passing no emotion ID with no emotion classifier loss.
In Decoder, 15 X 1 LConv Block is used instead of 17 X 1 due to memory issues.
Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
DeepSpeaker on RAVDESS dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

For vocoder, HiFi-GAN and MelGAN are supported.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 107

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (5) 🔗