All Projects → jaywalnut310 → Waveglow Vqvae

jaywalnut310 / Waveglow Vqvae

WaveGlow vocoder with VQVAE

Projects that are alternatives of or similar to Waveglow Vqvae

Mmdetection object detection demo
How to train an object detection model with mmdetection
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Text nn
Text classification models. Used a submodule for other projects.
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Ds and ml projects
Data Science & Machine Learning projects and tutorials in python from beginner to advanced level.
Stars: ✭ 56 (+0%)
Mutual labels:  jupyter-notebook
Ctr model zoo
some ctr model, implemented by PyTorch, such as Factorization Machines, Field-aware Factorization Machines, DeepFM, xDeepFM, Deep Interest Network
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Ko en neural machine translation
Korean English NMT(Neural Machine Translation) with Gluon
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Pyplotz
A light weight wrapper for matplotlib users with Chinese characters supported
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Graph Pointer Network
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Arabic poem generator
Generating Arabic poetry using Markov chains.
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Sta 663 2018
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Autoaugment
Unofficial implementation of the ImageNet, CIFAR 10 and SVHN Augmentation Policies learned by AutoAugment using pillow
Stars: ✭ 1,084 (+1835.71%)
Mutual labels:  jupyter-notebook
Introduction To Machine Learning
Introductory Course on Machine Learning in Python
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Mri Analysis Pytorch
MRI analysis using PyTorch and MedicalTorch
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Darknetpy
darknetpy is a simple binding for darknet's yolo detector
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Deepfly3d
Motion capture (markerless 3D pose estimation) pipeline and helper GUI for tethered Drosophila.
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Paraphrase Generator
A paraphrase generator built using the T5 model which produces paraphrased English sentences.
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Da detection
Progressive Domain Adaptation for Object Detection
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Reinforcement Learning
Implementation of Reinforcement Learning algorithms in Python, based on Sutton's & Barto's Book (Ed. 2)
Stars: ✭ 55 (-1.79%)
Mutual labels:  jupyter-notebook
Voila Demo
Demo for voila
Stars: ✭ 56 (+0%)
Mutual labels:  jupyter-notebook
Julia notebooks
Julia Jupyter/Colab Notebooks
Stars: ✭ 56 (+0%)
Mutual labels:  jupyter-notebook
Timeseriesanalysiswithpython
Stars: ✭ 1,083 (+1833.93%)
Mutual labels:  jupyter-notebook

WaveGlow vocoder with VQVAE

Tensorflow implementation of WaveGlow: A Flow-based Generative Network for Speech Synthesis and Neural Discrete Representation Learning.

This implementation includes multi-gpu and mixed precision(unstable yet) support. It is highly based on some github repositories:waveglow. Data used here are the LJSpeech dataset and VCTK Corpus.

You can choose local conditions among mel-spectrogram or vector-quantized representations and also choose whether to use speaker identity as a global condition. As more options, polyak-averaging, FiLM and weight normalization are implemented.

Audio Samples

LJ dataset

Mel spectrogram condition (original WaveGlow): https://drive.google.com/open?id=1HuV51fnhEZG_6vGubXVrer6lAtZK7py9

VQVAE condition: https://drive.google.com/open?id=1xcGSelMycn2g-72noZH4vPiPpG0d7pZq

VCTK Corpus (Voice conversion)

It does not work well at now :(

Source (360): https://drive.google.com/open?id=1CfEvnQS_dVYRhsvj8NDqogOJlzK7npTd

Target (303): https://drive.google.com/open?id=1-kcSglimKgJrRjLDfPbD7s5KxZuFRY-i

My Humble Contribution

I slightly modify the original VQVAE optimization technique to increase robustness w.r.t hyperparameter choices and diversity of latent code usage without index-collapse. That is,

  • the original technique contains 1) finding neareast latent codes given encoded vectors and 2) updating latent codes according to matching encoded vectors.
  • I modify them as 1) finding distribution of latent codes given encoded vectors and 2) updating latent codes to increase the likelihood given distribution of matching encoded vectors.
  • By replacing EMA with the gradient descent method, it can give additional gradient signals to latent codes to reduce reconstruction loss (which is impossible in the EMA setting.).

It resembles Soft-EM method a lot. The difference between Soft-EM is to replace closed form Maximization step with a gradient descent method. For more information, please see em_toy.ipynb or contact me([email protected]).

As I haven't investigated this method thoroughly, I cannot say it is better than previous methods in almost every cases. But I found this novel method works pretty well in all of my experimental settings (no index-collapse).

Pre-requisites

  1. Tensorflow 1.12 (1.13 would work with some deprecation warnings)
  2. (If fp16 training is needed) Volta GPUs

Setup

# 1. create dataset folder
mkdir datasets
cd datasets

# 2. Download and extract datasets
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -jxvf LJSpeech-1.1.tar.bz2

# Additionally, download VCTK Corpus
wget http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz
tar -zxvf VCTK-Corpus.tar.gz
cd ../filelists
python resample_vctk.py # Change sample rate

# 3. Create TFRecords
python generate_data.py

# Additionally, create VCTK TFRecords
python generate_data.py -c tfr_dir=datasets/vctk tfr_prefix=vctk train_files=filelists/vctk_sid_audio_text_train_filelist.txt eval_files=filelists/vctk_sid_audio_text_eval_filelist.txt

Training

# 1. Create log directory
mkdir ~/your-log-dir

# 2. (Optional) Copy configs
cp ./config.yml ~/your-log-dir

# 3. Run training
python train.py -m ~/your-log-dir

If you want to change hparams, then you can do it by choosing one of two options.

  • modify config.yml
  • add arguments as below:
    python train.py -m ~/your-log-dir --c hidden_size=512 num_heads=8
    

Example configs:

  • fp32 training: python train.py -m ~/your-log-dir --c ftype=float32 loss_scale=1
  • mel condition: python train.py -m ~/your-log-dir --c local_condition=mel use_vq=false
  • remove FiLM layers: python train.py -m ~/your-log-dir --c use_film=false

Pre-trained models

Compressed model directories with pretrained weights are available: WILL BE UPLOADED SOON!

You can generate samples with those models in inference.ipynb.

You may have to change tfr_dir and model_dir to work on your settings.

Disclaimer

  • For fp16 settings, you need 1 week to train 1M steps with 4 V100 GPUs.
  • I haven't tried fp32 training, so there might be some issues to train high quality models.
  • As fp16 training is not robust enough (at now), I usually train FiLM enabled model and unabled model consequently and choose one which survives.
  • For a single speaker dataset(LJ Speech dataset), trained model vocoding quality is good enough compared to mel-spectrogram condtioned one.
  • For multi-speaker dataset(VCTK Corpus), disentangling between speaker identity and local condition does not work well (at now). I am investigating reasons though.
  • The next step would be training Text-to-LatentCodes model(as Transformer) so that fully TTS is possible.
  • If you're interested in this project, please improve models with me!
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].