All Projects → keonlee9420 → STYLER

keonlee9420 / STYLER

Licence: MIT license
Official repository of STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech, INTERSPEECH 2021

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to STYLER

Expressive-FastSpeech2
PyTorch Implementation of Non-autoregressive Expressive (emotional, conversational) TTS based on FastSpeech2, supporting English, Korean, and your own languages.
Stars: ✭ 139 (+32.38%)
Mutual labels:  text-to-speech, tts, expressive-speech-synthesis, expressive-tts
WaveGrad2
PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Stars: ✭ 55 (-47.62%)
Mutual labels:  text-to-speech, tts, robust
Text-to-Speech-Landscape
No description or website provided.
Stars: ✭ 31 (-70.48%)
Mutual labels:  tts, style-transfer, prosody
Daft-Exprt
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis
Stars: ✭ 41 (-60.95%)
Mutual labels:  text-to-speech, tts, prosody
StyleSpeech
Official implementation of Meta-StyleSpeech and StyleSpeech
Stars: ✭ 161 (+53.33%)
Mutual labels:  text-to-speech, tts
open-speech-corpora
💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies
Stars: ✭ 841 (+700.95%)
Mutual labels:  text-to-speech, tts
SpeakIt Vietnamese TTS
Vietnamese Text-to-Speech on Windows Project (zalo-speech)
Stars: ✭ 81 (-22.86%)
Mutual labels:  text-to-speech, tts
FastSpeech2
Multi-Speaker Pytorch FastSpeech2: Fast and High-Quality End-to-End Text to Speech ✊
Stars: ✭ 64 (-39.05%)
Mutual labels:  text-to-speech, tts
vits
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Stars: ✭ 1,604 (+1427.62%)
Mutual labels:  text-to-speech, tts
Zero-Shot-TTS
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Stars: ✭ 33 (-68.57%)
Mutual labels:  text-to-speech, tts
golang-tts
Text-to-Speach golang package based in Amazon Polly service
Stars: ✭ 19 (-81.9%)
Mutual labels:  text-to-speech, tts
EMPHASIS-pytorch
EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System
Stars: ✭ 15 (-85.71%)
Mutual labels:  text-to-speech, tts
JSpeak
A Text to Speech Reader Front-end that Reads from the Clipboard and with Exceptionable Features
Stars: ✭ 16 (-84.76%)
Mutual labels:  text-to-speech, tts
VAENAR-TTS
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.
Stars: ✭ 66 (-37.14%)
Mutual labels:  text-to-speech, tts
TensorVox
Desktop application for neural speech synthesis written in C++
Stars: ✭ 140 (+33.33%)
Mutual labels:  text-to-speech, tts
soundpad-text-to-speech
Text-To-Speech for Soundpad
Stars: ✭ 29 (-72.38%)
Mutual labels:  text-to-speech, tts
Cross-Speaker-Emotion-Transfer
PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech
Stars: ✭ 107 (+1.9%)
Mutual labels:  text-to-speech, tts
brasiltts
Brasil TTS é um conjunto de sintetizadores de voz, em português do Brasil, que lê telas para portadores de deficiência visual. Transforma texto em áudio, permitindo que pessoas cegas ou com baixa visão tenham acesso ao conteúdo exibido na tela. Embora o principal público-alvo de sistemas de conversão texto-fala – como o Brasil TTS – seja formado…
Stars: ✭ 34 (-67.62%)
Mutual labels:  text-to-speech, tts
MouseTooltipTranslator
chrome extension - When mouse hover on text, it shows translated tooltip using google translate
Stars: ✭ 93 (-11.43%)
Mutual labels:  text-to-speech, tts
ukrainian-tts
Ukrainian TTS (text-to-speech) using Coqui TTS
Stars: ✭ 74 (-29.52%)
Mutual labels:  text-to-speech, tts

Hits

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Keon Lee, Kyumin Park, Daeyoung Kim

In our paper, we propose STYLER, a non-autoregressive TTS framework with style factor modeling that achieves rapidity, robustness, expressivity, and controllability at the same time.

Abstract: Previous works on neural text-to-speech (TTS) have been addressed on limited speed in training and inference time, robustness for difficult synthesis conditions, expressiveness, and controllability. Although several approaches resolve some limitations, there has been no attempt to solve all weaknesses at once. In this paper, we propose STYLER, an expressive and controllable TTS framework with high-speed and robust synthesis. Our novel audio-text aligning method called Mel Calibrator and excluding autoregressive decoding enable rapid training and inference and robust synthesis on unseen data. Also, disentangled style factor modeling under supervision enlarges the controllability in synthesizing process leading to expressive TTS. On top of it, a novel noise modeling pipeline using domain adversarial training and Residual Decoding empowers noise-robust style transfer, decomposing the noise without any additional label. Various experiments demonstrate that STYLER is more effective in speed and robustness than expressive TTS with autoregressive decoding and more expressive and controllable than reading style non-autoregressive TTS. Synthesis samples and experiment results are provided via our demo page, and code is available publicly.

Pretrained Models

You can download pretrained models.

Dependencies

Please install the python dependencies given in requirements.txt.

pip3 install -r requirements.txt

Training

Preparation

Clean Data

  1. Download VCTK dataset and resample audios to a 22050Hz sampling rate.
  2. We provide a bash script for the resampling. Refer to data/resample.sh for the detail.
  3. Put audio files and corresponding text (transcript) files in the same directory. Both audio and text files must have the same name, excluding the extension.
  4. You may need to trim the audio for stable model convergence. Refer to Yeongtae's preprocess_audio.py for helpful preprocessing, including the trimming.
  5. Modify the hp.data_dir in hparams.py.

Noisy Data

  1. Download WHAM! dataset and resample audios to a 22050Hz sampling rate.
  2. Modify the hp.noise_dir in hparams.py.

Vocoder

  1. Unzip hifigan/generator_universal.pth.tar.zip in the same directory.

Preprocess

First, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding as described in our paper and locate it in hp.speaker_embedder_dir.

Second, download the Montreal Forced Aligner(MFA) package and the pretrained (LibriSpeech) lexicon file through the following commands. MFA is used to obtain the alignments between the utterances and the phoneme sequences as FastSpeech2.

wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.1.0-beta.2/montreal-forced-aligner_linux.tar.gz
tar -zxvf montreal-forced-aligner_linux.tar.gz

wget http://www.openslr.org/resources/11/librispeech-lexicon.txt -O montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt

Then, process all the necessary features. You will get a stat.txt file in your hp.preprocessed_path/. You have to modify the f0 and energy parameters in the hparams.py according to the content of stat.txt.

python3 preprocess.py

Finally, get the noisy data separately from the clean data by mixing each utterance with a randomly selected piece of background noise from WHAM! dataset.

python3 preprocess_noisy.py

Train

Now you have all the prerequisites! Train the model using the following command:

python3 train.py

Inference

Prepare Texts

Create sentences.py in data/ which has a python list named sentences of texts to be synthesized. Note that sentences can contain more than one text.

# In 'data/sentences.py',
sentences = [
    "Nothing is lost, everything is recycled."
]

Prepare Reference Audios

Reference audio preparation has a similar process to training data preparation. There could be two kinds of references: clean and noisy.

First, put clean audios with corresponding texts in a single directory and modify the hp.ref_audio_dir in hparams.py and process all the necessary features. Refer to the Clean Data section of Train Preparation.

python3 preprocess_refs.py

Then, get the noisy references.

python3 preprocess_noisy.py --refs

Synthesize

The following command will synthesize all combinations of texts in data/sentences.py and audios in hp.ref_audio_dir.

python3 synthesize.py --ckpt CHECKPOINT_PATH

Or you can specify single reference audio in hp.ref_audio_dir as follows.

python3 synthesize.py --ckpt CHECKPOINT_PATH --ref_name AUDIO_FILENAME

Also, there are several useful options.

  1. --speaker_id will specify the speaker. The specified speaker's embedding should be in hp.preprocessed_path/spker_embed. The default value is None, and the speaker embedding is calculated at runtime on each input audio.

  2. --inspection will give you additional outputs that show the effects of each encoder of STYLER. The samples are the same as the Style Factor Modeling section on our demo page.

  3. --cont will generate the samples as the Style Factor Control section on our demo page.

    python3 synthesize.py --ckpt CHECKPOINT_PATH --cont --r1 AUDIO_FILENAME_1 --r2 AUDIO_FILENAME_1

    Note that --cont option is only working on preprocessed data. In detail, the audios' name should have the same format as VCTK dataset (e.g., p323_229), and the preprocessed data must be existing in hp.preprocessed_path.

TensorBoard

The TensorBoard loggers are stored in the log directory. Use

tensorboard --logdir log

to serve the TensorBoard on your localhost. Here are some logging views of the model training on VCTK for 560k steps.

Notes

  1. There were too many noise data where extraction was not possible through pyworld as in clean data. To resolve this, pysptk was applied to extract log f0 for the noisy data's fundamental frequency. The --noisy_input option will automate this process during synthesizing.

  2. If MFA-related problems occur during running preprocess.py, try to manually run MFA by the following command.

    # Replace $data_dir and $PREPROCESSED_PATH with ./VCTK-Corpus-92/wav48_silence_trimmed and ./preprocessed/VCTK/TextGrid, for example
    ./montreal-forced-aligner/bin/mfa_align $YOUR_data_dir montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt english $YOUR_PREPROCESSED_PATH -j 8
  3. DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding in our experiments.

  4. Currently, preprocess.py divides the dataset into two subsets: train and validation set. If you need other sets, such as a test set, the only thing to do is modifying the text files (train.txt or val.txt) in hp.preprocessed_path/.

Citation

If you would like to use or refer to this implementation, please cite our paper with the repo.

@inproceedings{lee21h_interspeech,
  author={Keon Lee and Kyumin Park and Daeyoung Kim},
  title={{STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={4643--4647},
  doi={10.21437/Interspeech.2021-838}
}

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].