All Projects → vsimkus → Voice Conversion

vsimkus / Voice Conversion

Voice conversion (VC) investigation using three variants of VAE

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Voice Conversion

Neural Ode
Jupyter notebook with Pytorch implementation of Neural Ordinary Differential Equations
Stars: ✭ 335 (+1495.24%)
Mutual labels:  vae
Joint Vae
Pytorch implementation of JointVAE, a framework for disentangling continuous and discrete factors of variation 🌟
Stars: ✭ 404 (+1823.81%)
Mutual labels:  vae
Deeplearningmugenknock
でぃーぷらーにんぐを無限にやってディープラーニングでDeepLearningするための実装CheatSheet
Stars: ✭ 684 (+3157.14%)
Mutual labels:  vae
Pycadl
Python package with source code from the course "Creative Applications of Deep Learning w/ TensorFlow"
Stars: ✭ 356 (+1595.24%)
Mutual labels:  vae
Deepnude An Image To Image Technology
DeepNude's algorithm and general image generation theory and practice research, including pix2pix, CycleGAN, UGATIT, DCGAN, SinGAN, ALAE, mGANprior, StarGAN-v2 and VAE models (TensorFlow2 implementation). DeepNude的算法以及通用生成对抗网络(GAN,Generative Adversarial Network)图像生成的理论与实践研究。
Stars: ✭ 4,029 (+19085.71%)
Mutual labels:  vae
Tensorflow Mnist Vae
Tensorflow implementation of variational auto-encoder for MNIST
Stars: ✭ 422 (+1909.52%)
Mutual labels:  vae
Beta Vae
Pytorch implementation of β-VAE
Stars: ✭ 326 (+1452.38%)
Mutual labels:  vae
Advanced Deep Learning With Keras
Advanced Deep Learning with Keras, published by Packt
Stars: ✭ 917 (+4266.67%)
Mutual labels:  vae
Disentangling Vae
Experiments for understanding disentanglement in VAE latent representations
Stars: ✭ 398 (+1795.24%)
Mutual labels:  vae
Tensorflow Vae Gan Draw
A collection of generative methods implemented with TensorFlow (Deep Convolutional Generative Adversarial Networks (DCGAN), Variational Autoencoder (VAE) and DRAW: A Recurrent Neural Network For Image Generation).
Stars: ✭ 577 (+2647.62%)
Mutual labels:  vae
Tensorflow Generative Model Collections
Collection of generative models in Tensorflow
Stars: ✭ 3,785 (+17923.81%)
Mutual labels:  vae
Pytorch Vqvae
Vector Quantized VAEs - PyTorch Implementation
Stars: ✭ 396 (+1785.71%)
Mutual labels:  vae
Generative Models
Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN
Stars: ✭ 438 (+1985.71%)
Mutual labels:  vae
Dsprites Dataset
Dataset to assess the disentanglement properties of unsupervised learning methods
Stars: ✭ 340 (+1519.05%)
Mutual labels:  vae
Generative Models
Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow.
Stars: ✭ 6,701 (+31809.52%)
Mutual labels:  vae
Pytorch rvae
Recurrent Variational Autoencoder that generates sequential data implemented with pytorch
Stars: ✭ 332 (+1480.95%)
Mutual labels:  vae
Awesome Vaes
A curated list of awesome work on VAEs, disentanglement, representation learning, and generative models.
Stars: ✭ 418 (+1890.48%)
Mutual labels:  vae
Variational Autoencoder
PyTorch implementation of "Auto-Encoding Variational Bayes"
Stars: ✭ 25 (+19.05%)
Mutual labels:  vae
Variational Autoencoder
Variational autoencoder implemented in tensorflow and pytorch (including inverse autoregressive flow)
Stars: ✭ 807 (+3742.86%)
Mutual labels:  vae
Sentence Vae
PyTorch Re-Implementation of "Generating Sentences from a Continuous Space" by Bowman et al 2015 https://arxiv.org/abs/1511.06349
Stars: ✭ 462 (+2100%)
Mutual labels:  vae

Voice Conversion on unaligned data

Voice Conversion (VC) is widely desirable across many industries and applications, including speaker anonymisation, film dubbing, gaming, and voice restoration for people who have lost their ability to speak. In this work we compare standard VAE, VQ-VAE and Gumbel VAE models as approaches to VC on the Voice Conversion Challenge 2016 dataset. We assess speech reconstruction and VC performance on both spectral frames as obtained from a WORLD vocoder and on the raw waveform data.

The full report and evaluation results can be found here.

How to train your VC model

1. Preprocess data

Place the raw VCC2016 dataset in data/vcc2016_raw/vcc2016_training.zip (raw audio features) or data/vcc2016/vcc2016_training.zip (WORLD features), or VCTK dataset in data/vctk/VCTK-Corpus.zip.

To generate the preprocessed data files run one of the following:

  • VCC2016 Raw data

    python preprocessing.py --dataset=VCCRaw2016 --trim_silence=True
    
  • VCC2016 WORLD data

    python preprocessing.py --dataset=VCCWORLD2016 --trim_silence=True
    
  • VCTK Raw data

    python preprocessing.py --dataset=VCTK --trim_silence=True --shuffle_order=True --split_samples=True
    

2. Train model

Run the following on a compute cluster to train the model.

python train_vqvae.py \ # Or train_vae.py, train_joint_vae.py
            --use_gpu=True \ # Whether to use GPU
            --gpu_id='0,1' \ # GPU ids to use
            --filepath_to_arguments_json_file="experiment_configs/config_file.json" \ # Model and experiment configuration file
            --dataset_root_path='data'

How to evaluate or perform VC

In order to evaluate the model you first have to create and load the trained model. Then you have to prepare your audio data using WORLD or mu-law preprocessing, as well as padding/trimming such that the input to the model is of the correct length. After conversion, you have to postprocess to produce the audio file. An example is given in this section.

1. Load configuration

from util.arg_extractor import extract_args_from_json

args = extract_args_from_json('experiment_configs/config_file.json')

2. Create model

VQVAE

from models.vqvae import VQVAE

model = VQVAE(
    input_shape=(1, 1, args.input_len),
    encoder_arch=args.encoder,
    vq_arch=args.vq,
    generator_arch=args.generator,
    num_speakers=args.num_speakers,
    speaker_dim=args.speaker_dim,
    use_gated_convolutions=args.use_gated_convolutions)

VAE

from models.vae import VAE

model = VAE(
    input_shape=(1, 1, args.input_len),
    encoder_arch=args.encoder,
    generator_arch=args.generator,
    latent_dim=args.latent_dim,
    num_speakers=args.num_speakers,
    speaker_dim=args.speaker_dim,
    use_gated_convolutions=args.use_gated_convolutions)

JointVAE

from models.joint_vae import JointVAE

model = JointVAE(
    input_shape=(1, 1, args.input_len),
    encoder_arch=args.encoder,
    generator_arch=args.generator,
    latent_dim=args.latent_dim,
    num_latents=args.num_latents,
    temperature=args.temperature,
    num_speakers=args.num_speakers,
    speaker_dim=args.speaker_dim,
    use_gated_convolutions=args.use_gated_convolutions)

3. Load model weights

To load the model weights we us the same experiment builders as used in training.

from experiment_builders.vqvae_builder import VQVAERawExperimentBuilder

# To load the model weights use VQVAERawExperimentBuilder, VQVAEWORLDExperimentBuilder, VAERawExperimentBuilder, VAEWORLDExperimentBuilder, JointVAERawExperimentBuilder, or JointVAEWORLDExperimentBuilder depending on the experiment
builder = VQVAERawExperimentBuilder(network_model=model,
                                    experiment_name=args.experiment_name,
                                    num_epochs=args.num_epochs,
                                    weight_decay_coefficient=args.weight_decay_coefficient,
                                    learning_rate=args.learning_rate,
                                    commit_coefficient=args.commit_coefficient, # This argument is only needed in VQVAE experiment builders
                                    device=torch.device('cpu'),
                                    continue_from_epoch=epoch, # Epoch of the model to load (should be your best validation model)
                                    train_data=None,
                                    val_data=None)

4. Perform conversion

Raw data feature experiments

import torchaudio
import util.torchaudio_transforms as transforms
from datasets.vcc_preprocessor import read_audio # Or import from vctk_preprocessor respectively

# Prepare mu-law encoding transformers
mulaw = transforms.MuLawEncoding(quantization_channels=args.num_input_quantization_channels)
mulaw_expanding = transforms.MuLawExpanding(quantization_channels=args.num_input_quantization_channels)

# Load audio
audio_path = os.path.expanduser(audio_path)
torchaudio.initialize_sox()
audio, sr = read_audio(audio_path, trim_silence=True)
torchaudio.shutdown_sox()

# Prepare an audio piece of appropriate length, e.g. as follows
audio = audio.unsqueeze(0)
audio_len = audio.shape[-1]
padding = transforms.PadTrim(math.ceil(audio.shape[-1] / args.input_len) * args.input_len)
audio = padding(audio.squeeze(0)).unsqueeze(0)
audio_split = audio.view(int(audio.shape[-1] / args.input_len), 1, args.input_len)

# Set target speaker id
target_speaker_id = torch.tensor(target_speaker_id, dtype=torch.long)

# Voice conversion
out_mulaw = builder.convert(x=mulaw(audio_split), y=target_speaker_id)

# Postprocess
out = mulaw_expanding(out_mulaw).detach().view(1, -1)
out = out[:, :audio_len]

# Save as audio file
torchaudio.save(filepath=out_file_path, src=out, sample_rate=sr)

WORLD feature experiments

from data.vcc_world_dataset import VCCWORLDDataset
from datasets.vcc_world_preprocessor import read_audio_and_extract_features, synthesize_from_WORLD_features

# Load audio
audio_path = os.path.expanduser(audio_path)
spectra, aperiodicity, f0, energy = read_audio_and_extract_features(audio_path)

# Set target speaker id
target_speaker_id = torch.tensor(target_speaker_id, dtype=torch.long)

# Voice conversion
dataset = VCCWORLDDataset('data', scale=True)
spectra_scaled = dataset.scale_spectra(torch.tensor(spectra)).unsqueeze(1)
spectra_out = builder.convert(x=spectra_scaled, y=speaker_id)
spectra_out = dataset.scale_spectra_back(spectra_out)
f0_converted = dataset.convert_f0(torch.tensor(f0), source_speaker_id, args.eval_speaker_id)
spectra_out = spectra_out.squeeze(1)
# Synthesize audio
audio_out = synthesize_from_WORLD_features(f0_converted.numpy(), spectra_out.numpy(), aperiodicity, energy)
audio_out = np.clip(audio_out, a_min=-0.9, a_max=0.9)

# Save as audio
torchaudio.save(filepath=out_file_path, src=torch.tensor(audio_out.copy()), sample_rate=16000)

Models

In our evaluation we have investigated three different VAE models.

Software dependencies

  • PyTorch v1.0.0 or later
  • numpy
  • pillow
  • tqdm
  • pyworld (for extracting WORLD features)
  • torchaudio (for preprocessing raw audio)

VCTK dataset modifications

The VCTK dataset has some silent files, hence the following audio samples were removed

  • p323_424, p306_151, p351_361, p345_292, p341_101, p306_352.

Contributors

  • Vaidotas Simkus
  • Simon Valentin
  • Will Greedy
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].