Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → mindslab-ai → Cotatron

mindslab-ai / Cotatron

Licence: bsd-3-clause

Official code for Cotatron @ INTERSPEECH 2020

Programming Languages

139335 projects - #7 most used programming language

Labels

pytorch speech-synthesis

Projects that are alternatives of or similar to Cotatron

A programmable version of Neil Thapen's Pink Trombone

Stars: ✭ 54 (-60.58%)

Mutual labels: speech-synthesis

Spokestack Python

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.

Stars: ✭ 103 (-24.82%)

Mutual labels: speech-synthesis

Deepvoice3 pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Stars: ✭ 1,654 (+1107.3%)

Mutual labels: speech-synthesis

Simple speech linguistic AI with Python

Stars: ✭ 66 (-51.82%)

Mutual labels: speech-synthesis

A PyTorch implementation of "WaveFlow: A Compact Flow-based Model for Raw Audio"

Stars: ✭ 95 (-30.66%)

Mutual labels: speech-synthesis

WaveRNN Vocoder + TTS

Stars: ✭ 1,636 (+1094.16%)

Mutual labels: speech-synthesis

A PyTorch implementation of Tacotron2, an end-to-end text-to-speech(TTS) system described in "Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions".

Stars: ✭ 43 (-68.61%)

Mutual labels: speech-synthesis

Legacy straight

A vocoder framework which had been widely used in research community since 1999.

Stars: ✭ 130 (-5.11%)

Mutual labels: speech-synthesis

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

Stars: ✭ 1,378 (+905.84%)

Mutual labels: speech-synthesis

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.

Stars: ✭ 111 (-18.98%)

Mutual labels: speech-synthesis

This is now the official location of the Merlin project.

Stars: ✭ 1,168 (+752.55%)

Mutual labels: speech-synthesis

Cross-lingual Voice Conversion

Stars: ✭ 91 (-33.58%)

Mutual labels: speech-synthesis

Crystal - C++ implementation of a unified framework for multilingual TTS synthesis engine with SSML specification as interface.

Stars: ✭ 108 (-21.17%)

Mutual labels: speech-synthesis

Tf Wavenet vocoder

Wavenet and its applications with Tensorflow

Stars: ✭ 58 (-57.66%)

Mutual labels: speech-synthesis

Text to Speech with PyTorch (English and Mongolian)

Stars: ✭ 122 (-10.95%)

Mutual labels: speech-synthesis

Cs224n Gpu That Talks

Attention, I'm Trying to Speak: End-to-end speech synthesis (CS224n '18)

Stars: ✭ 52 (-62.04%)

Mutual labels: speech-synthesis

Tacotron Pytorch

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Stars: ✭ 104 (-24.09%)

Mutual labels: speech-synthesis

Awesome Ai Services

An overview of the AI-as-a-service landscape

Stars: ✭ 133 (-2.92%)

Mutual labels: speech-synthesis

MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java

Stars: ✭ 1,699 (+1140.15%)

Mutual labels: speech-synthesis

Kalliope is a framework that will help you to create your own personal assistant.

Stars: ✭ 1,509 (+1001.46%)

Mutual labels: speech-synthesis

View All Similar Projects ➔

Cotatron — Official PyTorch Implementation

Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data
Seung-won Park, Doo-young Kim, Myun-chul Joe @ SNU, MINDsLab Inc.

Paper: https://arxiv.org/abs/2005.03295 (To appear in INTERSPEECH 2020)
Audio Samples: https://mindslab-ai.github.io/cotatron

Update: Enjoy our pre-trained model with Google Colab notebook!

Abstract: We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be trained with conventional TTS datasets. We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods based on Phonetic Posteriorgram (PPG). By training and evaluating our system with 108 speakers from the VCTK dataset, we outperform the previous method in terms of both naturalness and speaker similarity. Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance. Audio samples are available at https://mindslab-ai.github.io/cotatron, and the code with a pre-trained model will be made available soon.

Requirements

This repository was tested with following environment:

Python 3.6.8
PyTorch 1.4.0
PyTorch Lightning 0.7.1

The requirements are highlighted in requirements.txt.

Datasets

Preparing Data

To reproduce the results from our paper, you need to download:
- LibriTTS train-clean-100 split tar.gz link
- VCTK Corpus tar.gz link
Unzip each files.
Resample them into 22.05kHz using datasets/resample_delete.sh.

Note: The mel-spectrograms calculated from audio file will be saved as **.pt at first, and then loaded from disk afterwards.

Preparing Metadata

Following a format from NVIDIA/tacotron2, the metadata should be formatted like:

path_to_wav|transcription|speaker_id
path_to_wav|transcription|speaker_id
...

Metadata for LibriTTS train-clean-100 split and VCTK corpus are already prepared at datasets/metadata. If you wish to use a custom data, you need to make the metadata as shown above.

Training

Training our VC system is consisted of two steps: (1) training Cotatron, (2) training VC decoder on top of Cotatron.

git clone https://github.com/mindslab-ai/cotatron
cd cotatron

There are three yaml files in the config folder, which are configuration template for each model. They must be edited to match your training requirements (dataset, metadata, etc.).

cp config/global/default.yaml config/global/config.yaml
cp config/cota/default.yaml config/cota/config.yaml
cp config/vc/default.yaml config/vc/config.yaml

Here, all files with name other than default.yaml will be ignored from git (see .gitignore).

config/global: Global configs that are both used for training Cotatron & VC decoder.
- Fill in the blanks of: speakers, train_dir, train_meta, val_dir, val_meta.
- Example of speaker id list is shown in datasets/metadata/libritts_vctk_speaker_list.txt.
- When replicating the two-stage training process from our paper (training with LibriTTS and then LibriTTS+VCTK), please put both list of speaker ids from LibriTTS and VCTK at global config.
config/cota: Configs for training Cotatron.
- You may want to change: batch_size for GPUs other than 32GB V100, or change chkpt_dir to save checkpoints in other disk.
config/vc: Configs for training VC decoder.
- Fill in the blank of: cotatron_path.

1. Training Cotatron

To train the Cotatron, run this command:

python cotatron_trainer.py -c <path_to_global_config_yaml> <path_to_cotatron_config_yaml> \
                           -g <gpus> -n <run_name>

Here are some example commands that might help you understand the arguments:

# train from scratch with name "my_runname"
python cotatron_trainer.py -c config/global/config.yaml config/cota/config.yaml \
                           -g 0 -n my_runname

Optionally, you can resume the training from previously saved checkpoint by adding -p <checkpoint_path> argument.

2. Training VC decoder

After the Cotatron is sufficiently trained (i.e., producing stable alignment + converged loss), the VC decoder can be trained on top of it.

python synthesizer_trainer.py -c <path_to_global_config_yaml> <path_to_vc_config_yaml> \
                              -g <gpus> -n <run_name>

The optional checkpoint argument is also available for VC decoder.

Monitoring via Tensorboard

The progress of training with loss values and validation output can be monitored with tensorboard. By default, the logs will be stored at logs/cota or logs/vc, which can be modified by editing log.log_dir parameter at config yaml file.

tensorboard --log_dir logs/cota --bind_all # Cotatron - Scalars, Images, Hparams, Projector will be shown.
tensorboard --log_dir logs/vc --bind_all # VC decoder - Scalars, Images, Hparams will be shown.

Inference

We provide a Jupyter Notebook script to provide the code for inference and show some visualizations with resulting audio.

https://colab.research.google.com/drive/1L1sOs21l6CeU1Zavd5VMHGjo-aUUUGFp?usp=sharing This notebook provides pre-trained weights for Cotatron-based VC system and MelGAN vocoder.

Results

According to the user study done in MTurk, our Cotatron-based VC system performs significantly better than previous method in terms of both naturalness (MOS) and speaker similarity (DMOS). The objective results on speaker similarity (SCA) contradicts that from subjective results. See section 4.1 from our paper for details.

Approach	MOS	DMOS	SCA
Source as Target	4.28 ± 0.11	1.71 ± 0.22	0.9%
Target as Target	4.28 ± 0.11	4.78 ± 0.08	99.4%
Blow (previous method)	2.41 ± 0.14	1.95 ± 0.16	86.8%
Cotatron w/o residual encoder (Ours)	3.18 ± 0.14	4.06 ± 0.17	73.3%
Cotatron w/ residual encoder (Ours)	3.41 ± 0.14	3.89 ± 0.18	78.5%

Implementation details

Here are some noteworthy details of implementation, which could not be included in our paper due to the lack of space:

Masked padding

Since the length of training samples are not identical, they are zero-padded to match the batch's max length for GPU efficiency. Therefore, the output of every convolutional layer is masked to prevent the results of the padded areas from affecting the non-padded area. Please refer to appendix A.1 of Binkowski et al. (GAN-TTS, 2020).

Padded Instance Normalization

We exclude the padded area from the statistics calculation for the instance normalization layer of the residual encoder. See modules/padded_instancenorm.py.

License

BSD 3-Clause License.

Citation & Contact

@article{park2020cotatron,
  title   = {Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data},
  author  = {Park, Seung-won and Kim, Doo-young and Joe, Myun-chul},
  journal = {arXiv preprint arXiv:2005.03295},
  year    = {2020},
}

If you have a question or any kind of inquiries, please contact Seung-won Park at [email protected]

Repository structure

│  .gitignore
│  cotatron.py
│  cotatron_trainer.py      # Trainer file for Cotatron
|  LICENSE
│  README.md
│  requirements.txt
│  synthesizer.py
│  synthesizer_trainer.py   # Trainer file for VC decoder (named as "synthesizer")
│
├─config
│  ├─cota
│  │      default.yaml      # configuration template for Cotatron
│  │
│  ├─global
│  │      default.yaml      # configuration template for both Cotatron and VC decoder
│  │
│  └─vc
│         default.yaml      # configuration template for VC decoder
│
├─datasets                  # TextMelDataset and text preprocessor
│  │  cmudict-0.7b_fix.txt  # Modified version of CMUDict, for representation mixing (https://arxiv.org/abs/1811.07240)
|  |  resample_delete.sh    # Shellscript for audio resampling
│  │  text_mel_dataset.py
│  │  __init__.py
│  │
│  ├─metadata
│  │       (omitted)        # Refer to README.md within the folder.
│  │
│  └─text
│          cleaners.py
│          cmudict.py
|          LICENSE
│          numbers.py
│          symbols.py
│          __init__.py
├─docs                      # Audio samples and code for webpage https://mindslab-ai.github.io/cotatron
|      (omitted)
│
├─melgan                    # MelGAN vocoder w/o training code (https://arxiv.org/abs/1910.06711)
│      generator.py
|      LICENSE
│      res_stack.py
│
├─modules                   # All modules that compose model, including mel.py
│      attention.py         # Implementation of DCA (https://arxiv.org/abs/1910.10288)
│      classifier.py
│      cond_bn.py
│      encoder.py
│      mel.py               # Code for calculating mel-spectrogram from raw audio
│      padded_instancenorm.py
│      residual.py
│      tts_decoder.py
│      vc_decoder.py
│      zoneout.py
│      __init__.py
│
└─utils                     # Misc. code snippets, usually for logging
        loggers.py
        plotting.py
        utils.py

References

This implementation uses code from following repositories:

This README and the webpage for the audio samples are inspired by:

The audio samples on our webpage are partially derived from:

LibriTTS: Dataset for multispeaker TTS, derived from LibriSpeech.
VCTK: 46 hours of English speech from 108 speakers.
KSS: Korean Single Speaker Speech Dataset.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 137

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗