All Projects → DongyaoZhu → Vq Vae Wavenet

DongyaoZhu / Vq Vae Wavenet

TensorFlow implementation of VQ-VAE with WaveNet decoder, based on https://arxiv.org/abs/1711.00937 and https://arxiv.org/abs/1901.08810

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Vq Vae Wavenet

wavenet-like-vocoder
Basic wavenet and fftnet vocoder model.
Stars: ✭ 20 (-50%)
Mutual labels:  wavenet
ttslearn
ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)
Stars: ✭ 158 (+295%)
Mutual labels:  wavenet
Flowavenet
A Pytorch implementation of "FloWaveNet: A Generative Flow for Raw Audio"
Stars: ✭ 471 (+1077.5%)
Mutual labels:  wavenet
wavenet
Audio source separation (mixture to vocal) using the Wavenet
Stars: ✭ 20 (-50%)
Mutual labels:  wavenet
chainer-ClariNet
A Chainer implementation of ClariNet.
Stars: ✭ 45 (+12.5%)
Mutual labels:  wavenet
Pytorchwavenetvocoder
WaveNet-Vocoder implementation with pytorch.
Stars: ✭ 269 (+572.5%)
Mutual labels:  wavenet
Seriesnet
Time series prediction using dilated causal convolutional neural nets (temporal CNN)
Stars: ✭ 185 (+362.5%)
Mutual labels:  wavenet
Wavenet Stt
An end-to-end speech recognition system with Wavenet. Built using C++ and python.
Stars: ✭ 18 (-55%)
Mutual labels:  wavenet
chainer-Fast-WaveNet
A Chainer implementation of Fast WaveNet(mel-spectrogram vocoder).
Stars: ✭ 33 (-17.5%)
Mutual labels:  wavenet
Pycadl
Python package with source code from the course "Creative Applications of Deep Learning w/ TensorFlow"
Stars: ✭ 356 (+790%)
Mutual labels:  wavenet
Music-Style-Transfer
Source code for "Transferring the Style of Homophonic Music Using Recurrent Neural Networks and Autoregressive Model"
Stars: ✭ 16 (-60%)
Mutual labels:  wavenet
constant-memory-waveglow
PyTorch implementation of NVIDIA WaveGlow with constant memory cost.
Stars: ✭ 36 (-10%)
Mutual labels:  wavenet
Clarinet
A Pytorch Implementation of ClariNet
Stars: ✭ 273 (+582.5%)
Mutual labels:  wavenet
birdsong-generation-project
Generating birdsong with WaveNet
Stars: ✭ 26 (-35%)
Mutual labels:  wavenet
Speech Denoising Wavenet
A neural network for end-to-end speech denoising
Stars: ✭ 516 (+1190%)
Mutual labels:  wavenet
Vq Vae Speech
PyTorch implementation of VQ-VAE + WaveNet by [Chorowski et al., 2019] and VQ-VAE on speech signals by [van den Oord et al., 2017]
Stars: ✭ 187 (+367.5%)
Mutual labels:  wavenet
hifigan-denoiser
HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
Stars: ✭ 88 (+120%)
Mutual labels:  wavenet
Pytorch Uniwavenet
Stars: ✭ 30 (-25%)
Mutual labels:  wavenet
Parallelwavegan
Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN) with Pytorch
Stars: ✭ 682 (+1605%)
Mutual labels:  wavenet
Time Series Prediction
A collection of time series prediction methods: rnn, seq2seq, cnn, wavenet, transformer, unet, n-beats, gan, kalman-filter
Stars: ✭ 351 (+777.5%)
Mutual labels:  wavenet

VQ-VAE-WaveNet

This is a TensorFlow implementation of vqvae with wavenet decoder, based on https://arxiv.org/abs/1711.00937 and https://arxiv.org/abs/1901.08810.

Dependencies:

TensorFlow r1.12 / r1.14, numpy, librosa, scipy, tqdm

Results

The folder results contains some reconstructed audio. Speaker conversion works well, but encoder (local condition) needs some more tuning.

Model

Encoder

There are 3 encoders implemented:

  • 64 6 layers strided conv, as mentioned in original paper (default)
  • Magenta encoder from nsynth-magenta, wavenet alike
  • 2019 the one described in https://arxiv.org/abs/1901.08810

Parameters can be found in Encoder/encoder.py and model_parameters.json.

VQ

There are 2 ways to train the embedding:

  • train $z_e$ and $e_k$ separately, as described in original paper (default)
  • train them together without tf.stop_gradient

Initialising the embedding:

  • uniform scaling (default)
  • random normal init

This could be turned off as well, in which case an AE is trained.

Parameters can be found in model_parameters.json.

Decoder

WaveNet decoder.

Parameters can be found in wavenet_parameters.json.

Training

Dataset

Supports VCTK (default) and LibriSpeech. Download data and put the unzipped folders 'VCTK-Corpus' or 'LibriSpeech' in the folder data. To train from custom datasets, refer to dataset.py for making iterators.

example usage:

python3 train.py -dataset VCTK -length 6656 -batch 8 -step 100000 -save saved_model/weights

  • -dataset VCTK or LibriSpeech
  • -length length of segment to use in training, must be multiples of largest dilation rate, recommended 320ms
  • -batch batch size
  • -step number of steps to train
  • -save save to (e.g. saved_model/weights)
  • -restore resume from pretrained model (e.g. saved_model/weights-110640)
  • -interval steps between each log written to disk

Generation

Implements fast generation; starts from zeros.

example usage: python3 generate.py -restore saved_model/weights-110640 -audio data/VCTK-Corpus/wav48/p225/p225_001.wav -speakers p225 p226 p227 p228 -mode sample

  • -restore where to restore trained model and save embedding & generated audio
  • -audio which audio to use as local condition
  • -speakers which speaker(s) to use as global condition, must be consistent with training data
  • -mode method to sample from predicted quantised distribution (sample, greedy)

Visualisation

For now it saves the trained vq embedding space, and visualises through http://projector.tensorflow.org

example usage: python3 visualise.py -embedding embedding_110640.npy -speaker speaker_embedding_110640.npy -save embeddings then upload tsv files in folder embeddings to the website.

Note that the speaker embedding separated gender almost perfectly (upload the vec and meta files to http://projector.tensorflow.org, then search for #f# or #m#). Also q(z|x) did slowly converge to the assumed uniform prior distribution.

Micellaneous

Stuff I've tried:

  • At each frame of encoder output, instead of predicting a vector and find nearest neighbour and use the index as a one-hot categorical distribution, I make the last encoder channel = k, then apply a softmax so it represents a k-way softmax distribution, whose KL-divergence with a uniform prior is the same as a cross entropy loss. Add this loss in addition to the original 3 losses.

  • First train without decoder, then freeze embedding & encoder and train decoder. This made the vq embedding space more diverse than training the whole model altogether.

TODO

  • [ ] Train a prior based on vq

Alternative Implementation

The folder Magenta contains an implementation that I collaged from 'official' code. High coupling. My own implementation draws insights from there. Training and Generating are pretty similar.

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].