All Projects → RussellSB → tt-vae-gan

RussellSB / tt-vae-gan

Licence: other
Timbre transfer with variational autoencoding and cycle-consistent adversarial networks. Able to transfer the timbre of an audio source to that of another.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to tt-vae-gan

Deep Generative Models
Deep generative models implemented with TensorFlow 2.0: eg. Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Deep Boltzmann Machine (DBM), Convolutional Variational Auto-Encoder (CVAE), Convolutional Generative Adversarial Network (CGAN)
Stars: ✭ 34 (-8.11%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
Focal Frequency Loss
Focal Frequency Loss for Generative Models
Stars: ✭ 141 (+281.08%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
Repo 2017
Python codes in Machine Learning, NLP, Deep Learning and Reinforcement Learning with Keras and Theano
Stars: ✭ 1,123 (+2935.14%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
Textbox
TextBox is an open-source library for building text generation system.
Stars: ✭ 257 (+594.59%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
Generative Continual Learning
No description or website provided.
Stars: ✭ 51 (+37.84%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
Pytorch Rl
This repository contains model-free deep reinforcement learning algorithms implemented in Pytorch
Stars: ✭ 394 (+964.86%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
Deep Learning With Python
Example projects I completed to understand Deep Learning techniques with Tensorflow. Please note that I do no longer maintain this repository.
Stars: ✭ 134 (+262.16%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
Generative models tutorial with demo
Generative Models Tutorial with Demo: Bayesian Classifier Sampling, Variational Auto Encoder (VAE), Generative Adversial Networks (GANs), Popular GANs Architectures, Auto-Regressive Models, Important Generative Model Papers, Courses, etc..
Stars: ✭ 276 (+645.95%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
deep-blueberry
If you've always wanted to learn about deep-learning but don't know where to start, then you might have stumbled upon the right place!
Stars: ✭ 17 (-54.05%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
precision-recall-distributions
Assessing Generative Models via Precision and Recall (official repository)
Stars: ✭ 80 (+116.22%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
Rectorch
rectorch is a pytorch-based framework for state-of-the-art top-N recommendation
Stars: ✭ 121 (+227.03%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
vaegan
An implementation of VAEGAN (variational autoencoder + generative adversarial network).
Stars: ✭ 88 (+137.84%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
Video prediction
Stochastic Adversarial Video Prediction
Stars: ✭ 247 (+567.57%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
FAST-RIR
This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.
Stars: ✭ 90 (+143.24%)
Mutual labels:  speech, generative-adversarial-network
lagvae
Lagrangian VAE
Stars: ✭ 27 (-27.03%)
Mutual labels:  generative-adversarial-network, variational-autoencoder
wikipron
Massively multilingual pronunciation mining
Stars: ✭ 167 (+351.35%)
Mutual labels:  speech
skip-thought-gan
Generating Text through Adversarial Training(GAN) using Skip-Thought Vectors
Stars: ✭ 44 (+18.92%)
Mutual labels:  generative-adversarial-network
AvatarGAN
Generate Cartoon Images using Generative Adversarial Network
Stars: ✭ 24 (-35.14%)
Mutual labels:  generative-adversarial-network
Fre-GAN-pytorch
Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
Stars: ✭ 73 (+97.3%)
Mutual labels:  speech
Deep-Learning
It contains the coursework and the practice I have done while learning Deep Learning.🚀 👨‍💻💥 🚩🌈
Stars: ✭ 21 (-43.24%)
Mutual labels:  generative-adversarial-network

Timbre Transfer with VAE-GAN & WaveNet

This pipeline follows and extends the work of Albadawy & Lyu 2020. The work that used this shows (amongst other things) that their proposed voice conversion model is also applicable to context of musical instruments, therefore reforming the conversion to a more generalised audio style - timbre.

You can find the pre-print of this work here. Please be sure to reference it if you use this code for your research:

@misc{sammutbonnici2021timbre,
      title={Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks}, 
      author={Russell Sammut Bonnici and Charalampos Saitis and Martin Benning},
      year={2021},
      eprint={2109.02096},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Summary

The implemented pipeline makes use of the following projects (click for orginating repos):

  1. voice_conversion - Performs VAE-GAN style transfer in the time-frequency melspectrogram domain.
  2. wavenet_vocoder - Vocodes melspectrogram output from style transfer model to realistic audio.
  3. fad - Computes Fréchet Audio Distance (using VGGish) to evaluate the quality of wavenet vocoder output.

Index

Demo

Female to Male:

G2_970641406_9a20ee636a_4

Violin to Trumpet:

G1_AuSep_2_vn_32_Fugue

Hardware

Recommended GPU VRAM per model:

  • voice_conversion - 2 GB
  • wavenet_vocoder - 8 GB
  • fad - 16 GB

Note

  • If you train many-to-many (more than 2 timbres) you may need a stronger GPU for voice_conversion
  • If wavenet_vocoder exceeds your resources, you can try a less computationally intense vocoder (such as melgan)
  • If fad exceeds your computational resources, you can ignore it. It is not necessary for timbre transfer - only for evaluating it.

Tutorial

0. Setup

0.1. Clone this repo as well as its sub modules for voice_conversion and wavenet_vocoder with git:

git clone https://github.com/RussellSB/tt-vae-gan.git
cd tt-vae-gan 
git submodule init 
git submodule update

0.2. Ensure that your environment has installed the dependencies of the submodules.


1. VAE-GAN

1.0. Download the dataset.

Choose:

  • Flickr 8k Audio for speakers (link)
  • URMP for instruments (link)
  • Other

1.1. Prepare your data.

Run one of the python commands for extracting timbre files of interest:

cd data_prep
python flickr --dataroot [path/to/flickr_audio/flickr_audio/]  # For Flickr
python urmp --dataroot [path/to/urmp/]  # For URMP
  • By default this will output to voice_conversion/data/data_[name]/.
  • [name] would be either flickr or urmp
  • You can add more timbres by duplicating lines 27-28 and changing each last argument to the timbre id of interest.

Alternatively, you can use your own dataset. Just set it up so that in voice_conversion/data/data_mydataset you have the following structure:

voice_conversion/data/data_mydataset
├── spkr_1
│   ├── sample.wav
├── spkr_2
│   ├── sample.wav
│   ...
└── spkr_N
    ├── sample.wav
    ...
# The directory under each speaker cannot be nested.

1.2. Preprocess your data

cd ../voice_conversion/src
python preprocess.py --dataset ../data/data_[name]
  • Set more than two timbres by also adding --n_spkrs [int]. By default n_spkrs=2.

1.3. Train on your data.

python train.py --model_name [expname] --dataset ../data/data_[name] --n_spkrs 2
  • Can set max epochs with --n_epochs [int] (100 default)
  • Can set how often to save models with --checkpoint_interval [int] (1 epoch by default)
  • Can adjust number of speakers to more than 2 for a many-to-many approach.

1.4. Infer with VAE-GAN and reconstruct raw audio with Griffin Lim.

python inference.py --model_name [expname] --epoch [int] --trg_id 2 --src_id 1 --wavdir [path/to/testset_1]
  • Instead of --wavdir you can do --wav for a single file input.
  • Since only the data prep for wavenet creates audio directories for each train/eval/test split, use that.
  • Do step 2.1. then come back to this. Can then set something like --wavdir ../../wavenet_vocoder/egs/gaussian/data/flickr_2/eval.
  • Griffin Lim reconstructions are low quality, but will be improved to sufficient realism when passed to WaveNet vocoder.

2. WaveNet

2.1. Prepare your data again (based on data extracted for VAE-GAN).

cd ../../data_prep
python wavenet.py --dataset ../voice_conversion/data/data_[name] --outdir ../wavenet_vocoder/egs/gaussian/data --tag [name]

2.2. Preprocess your data again (based on WaveNet specs this time).

cd ../wavenet_vocoder/egs/gaussian
spk="[name]_[id]" ./run.sh --stage 1 --stop-stage 1
  • For two speakers ids this would be either 1 or 2. If you want to train all, make [id] as "_all" or something
  • You need to run the .sh command for each target timbre if you want to train specific vocoders.
  • Note that alternate to specific vocoders train general vocoders (a vocoder for all speakers instead of for each speaker). For this prepare training and test data in the previous step from all speakers in one dataset.

2.3. Train a wavenet vocoder.

spk="[name]_[id]" hparams=conf/[name].json ./run.sh --stage 2 --stop-stage 2 
  • Just like preprocessing, you need to run this for each target timbre for specific vocoding.
  • Note that by default this is set to look for and load a pretrained model (provided below). If you'd like to train wavenet with new weights from scratch please remove line 78 in run.sh.
  • You can add CUDA_VISIBLE_DEVICES="0,1" before ./run.sh if you have two GPUs (training takes quite long).

2.4. Infer style transferred reconstructions to improve their perceptual quality.

spk="[name]_[id_2]" inferdir="[expname]_[epoch]_G[id_2]_S[id_1]" hparams=conf/flickr.json ./infer.sh
  • [id_2] is the target id. [id_1] is the source id.
  • For example, for transfer from ids 1-to-2 with experiment 'initial' and trained VAE-GAN after epoch 99, inferdir="initial_99_G2_S1".
  • You can also add CUDA_VISIBLE_DEVICES="0,1" before ./infer.sh (inferring takes quite long).

3. FAD

3.0. Download the VGGish model pretrained on AudioSet.

cd ../../../fad
mkdir -p data
curl -o data/vggish_model.ckpt https://storage.googleapis.com/audioset/vggish_model.ckpt

3.1. Create csvs for referencing files of timbre sets (real train set, then fake test set, both of same target timbre)

ls --color=never ../wavenet_vocoder/egs/gaussian/data/[name]_[id_2]/train_no_dev/*.wav  > test_audio/[name]_[id_2].csv
ls --color=never ../wavenet_vocoder/egs/gaussian/out/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]/*_gen.wav > test_audio/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1].csv

3.2. Embed each of the timbre sets with VGGish

mkdir -p stats
python -m frechet_audio_distance.create_embeddings_main  --input_files test_audio/[name]_[id_2].csv \
                                                        --stats stats/[name]_[id_2]_stats
                                                      
python -m frechet_audio_distance.create_embeddings_main  --input_files test_audio/[name]_[id_2]_[expname]_[epoch]_G[id_2].csv \
                                                        --stats stats/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]_stats
  • Run the python command for the train set and test set
  • Can add CUDA_VISIBLE_DEVICES="0,1" before python if possible (embedding takes a while)

3.3. Compute Frechet Distance between stats of the real and generated.

python -m frechet_audio_distance.compute_fad --background_stats stats/[name]_[id_2]_stats \
                                             --test_stats stats/[name]_[id_2]_[expname]_[epoch]_G[id_2]_S[id_1]_stats
  • Background refers to the real training audio set.
  • Test refers to the fake style transferred audio set.

Pretrained Models

With respect to the current data preperation set up, the following one-to-one VAE-GANs and specific vocoders were trained:

Model Flickr URMP
VAE-GAN link link
WaveNet link link

Pretrained VAE-GAN

  1. Create directory voice_conversion/src/saved_models/initial.
  2. Drag .pth files to that directory.
  3. Call with --model_name initial --epoch 99 for inference (with epoch 490 for URMP).

Notes

  • G1 is female for Flickr, trumpet for URMP.
  • G2 is male for Flickr, violin for URMP.

Pretrained WaveNet

  1. Create directory wavenet_vocoder/egs/gaussian/exp/
  2. Drag the folder such as flickr_1_train_no_dev_flickr into that directory.
  3. Drag the meanvar.joblib file within the folder to a new directory following wavenet_vocoder/egs/gaussian/dump/[spk]/logmelspectrogram/org - where [spk] corresponds to flickr_1 for example.
  4. Call ./infer.sh with appropriate arguments such as spk="flickr_1" inferdir="initial_99_G1_S2".

Notes

  • flickr_1 is female, flickr_2 is male.
  • urmp_1 is trumpet, urmp_2 is violin.
  • inferdir is just a local directory in voice_conversion/src/out_infer/. You can point it to any local dir within that path for input.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].