All Projects → jgarciapueyo → MelNet-SpeechGeneration

jgarciapueyo / MelNet-SpeechGeneration

Licence: other
Implementation of MelNet in PyTorch to generate high-fidelity audio samples

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to MelNet-SpeechGeneration

AdaSpeech
AdaSpeech: Adaptive Text to Speech for Custom Voice
Stars: ✭ 108 (+468.42%)
Mutual labels:  speech, speech-synthesis, pytorch-implementation
Lingvo
Lingvo
Stars: ✭ 2,361 (+12326.32%)
Mutual labels:  speech, speech-synthesis
Voice2Mesh
CVPR 2022: Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
Stars: ✭ 67 (+252.63%)
Mutual labels:  speech, speech-synthesis
Wavegrad
Implementation of Google Brain's WaveGrad high-fidelity vocoder (paper: https://arxiv.org/pdf/2009.00713.pdf). First implementation on GitHub.
Stars: ✭ 245 (+1189.47%)
Mutual labels:  speech, speech-synthesis
Diffwave
DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Stars: ✭ 139 (+631.58%)
Mutual labels:  speech, speech-synthesis
Wavegrad
A fast, high-quality neural vocoder.
Stars: ✭ 138 (+626.32%)
Mutual labels:  speech, speech-synthesis
Tacotron pytorch
PyTorch implementation of Tacotron speech synthesis model.
Stars: ✭ 242 (+1173.68%)
Mutual labels:  speech, speech-synthesis
Java Speech Api
The J.A.R.V.I.S. Speech API is designed to be simple and efficient, using the speech engines created by Google to provide functionality for parts of the API. Essentially, it is an API written in Java, including a recognizer, synthesizer, and a microphone capture utility. The project uses Google services for the synthesizer and recognizer. While this requires an Internet connection, it provides a complete, modern, and fully functional speech API in Java.
Stars: ✭ 490 (+2478.95%)
Mutual labels:  speech, speech-synthesis
IMS-Toucan
Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart. Objectives of the development are simplicity, modularity, controllability and multilinguality.
Stars: ✭ 295 (+1452.63%)
Mutual labels:  speech, speech-synthesis
TFGAN
TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis
Stars: ✭ 65 (+242.11%)
Mutual labels:  speech, speech-synthesis
StyleSpeech
Official implementation of Meta-StyleSpeech and StyleSpeech
Stars: ✭ 161 (+747.37%)
Mutual labels:  speech, speech-synthesis
Durian
Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.
Stars: ✭ 111 (+484.21%)
Mutual labels:  speech, speech-synthesis
Wsay
Windows "say"
Stars: ✭ 36 (+89.47%)
Mutual labels:  speech, speech-synthesis
Wavenet vocoder
WaveNet vocoder
Stars: ✭ 1,926 (+10036.84%)
Mutual labels:  speech, speech-synthesis
Lightspeech
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search
Stars: ✭ 31 (+63.16%)
Mutual labels:  speech, speech-synthesis
Neural Voice Cloning With Few Samples
Implementation of Neural Voice Cloning with Few Samples Research Paper by Baidu
Stars: ✭ 211 (+1010.53%)
Mutual labels:  speech, speech-synthesis
Pysptk
A python wrapper for Speech Signal Processing Toolkit (SPTK).
Stars: ✭ 297 (+1463.16%)
Mutual labels:  speech, speech-synthesis
Voice Builder
An opensource text-to-speech (TTS) voice building tool
Stars: ✭ 362 (+1805.26%)
Mutual labels:  speech, speech-synthesis
idear
🎙️ Handsfree Audio Development Interface
Stars: ✭ 84 (+342.11%)
Mutual labels:  speech, speech-synthesis
Zero-Shot-TTS
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Stars: ✭ 33 (+73.68%)
Mutual labels:  speech, speech-synthesis

MelNet: PyTorch implementation

This project is a PyTorch implementation of S. Vasquez and M. Lewis, “Melnet: A generative model for audio in the frequency domain” which aims at generating high-fidelity audio samples by using two-dimensional time-frequency representations (spectrograms) in conjunction with a highly expressive probabilistic model and multiscale generation procedure.

For a more complete description than the one found in this README of MelNet, this implementation and the results achieved, please see the Report or the Presentation of the project.

Table of contents

  1. Results
  2. Project Structure
  3. Setup
    1. Setup with Ananconda
    2. Setup with Docker
  4. Usage
  5. Description of MelNet
  6. Notes

Results

Context

Each model's tiers models were trained individually on a NVIDIA GTX 2080 with 8GB of VRAM. The size of each tier is defined by the number of layers and the hidden size (RNN hidden state size), being hidden size the parameter that affects the size of a tier the most. To be able to fit in the GPU memory, the hidden size of the models had to be reduced to 200 (from the 512 used in the original paper).

To illustrate the architectures of the trained models in a compact way, from now on we will follow this pattern: d(dataset)_t(number of tiers)_l[number of layers]_hd(hidden size)_gmm(GMM Mixture Components).

Initial Results

The first model was trained on a Podcast dataset (dataset containing dialogue-based podcast audios), following the architecture used by Vasquez and Lewis in MelNet for unconditional speech for Blizzard (Table 1) but with a hidden size of 200 instead of 512 due to memory constraints.

Spectrogram viewed at different stages generated by the initial archi- tecture. Architecture: dpodcast_t6_l12.5.4.3.2.2_hd200_gmm10

Architecture: dpodcast_t6_l12.5.4.3.2.2_hd200_gmm10. The wav file can be found here.

The Upsampling Layers appear to be able to add detail to the spectrogram generated by previous tiers, but the initial tier was not able to dictate a coherent high-level structure.

Experiments with Upsampling Layers Only

As a means to see how much impact the initial tier has in the final spectrogram, we modified the algorithm for synthesis. In the normal synthesis algorithm, the first tier generates unconditionally a low-resolution spectrogram and the upsampling tiers add detail. In the modified synthesis algorithm, the first tier is an item from the dataset (a low-resolution spectrogram) and only the upsampling layers are used to add detail.

Spectrogram viewed at different stages generated using a real low- resolution spectrogram. Architecture: dljspeech_t6_l0.7.6.5.4.4_hd200_gmm10. The first tier does not have layers because it was not used.

Architecture: dljspeech_t6_l0.7.6.5.4.4_hd200_gmm10. The first tier does not have layers because it was not used. The wav file can be found here.

Experiments with First Tier

Knowing that the first tier is important because it dictates the high-level structure of the spectrogram, we compare the impact that the hidden size and the number of layers have on the loss of the firs tier.

First tier: Hidden size vs. Loss and Number of Layers vs. Loss Hidden size vs. Loss plot: Architecture: dljspeech_t6_l14.5.4.3.2.2_hdX_gmm10.
Number of layers vs. Loss plot: Architecture: dljspeech_t6_lX.5.4.3.2.2_hd64_gmm10.

From these results, we can conclude that the parameter hidden size has a greater impact on the loss than the number of layers.

Final Result

Finally, we trained the biggest model we could, after seeing that the size of the tiers has an impact on the quality of the spectrograms generated.

Spectrogram viewed at different stages.

Architecture: dljspeech_t6_l12.7.6.5.4.4_hd200_gmm10. The wav file can be found here.

Project Structure

SpeechGeneration-MelNet
|-- assets/           <- images used in the README.md, Report and Presentation
|-- datasets/         <- original data used to train the model (you have to create it)
|
|-- logs/             <- (you have to create it or it will be created automatically)
|   |-- general/      <- logs for general training
|   `-- tensorboard/  <- logs for displaying in tensorboard
|
|-- models/
|   |-- chkpt/     <- model weigths for different runs stored in pickle format. It stores also the
|   |                 training parameters. (you have to create it or it'll be created automatically)
|   `-- params/    <- description of the parameters to train and do speech synthesis according 
|                     to the paper and the dataset
|
|-- notebooks/     <- Jupyter Notebooks explaining different parts of the data pipeline 
|                     or the model
|
|-- results/       <- spectrograms, waveforms and wav files synthesized from trained models
|
|-- src/                  <- source code for use in this project
|   |-- data/             <- scripts to download and load the data
|   |-- dataprocessing/   <- scripts to turn raw data into processed data to input to the model
|   |-- model/            <- scripts of the model presented in the paper
|   |-- utils/            <- scripts that are useful in the project
|   |-- synthesis.py      <- main program to perform synthesis (see Usage section)
|   `-- train.py          <- main program to perform training (see Usage section)
|
|-- utils/                <- files for running the model in Docker
|
|-- environment.yml      <- file for reproducting the environment (created with anaconda)  
`-- Makefile             <- file with commands to run the project without effort

Setup

Setup with Anaconda

  1. Download and install Anaconda
  2. Clone the source code with git:
git clone https://github.com/jgarciapueyo/MelNet-SpeechGeneration
cd MelNet-SpeechGeneration
  1. Prepare the environment with Anaconda and activate it
conda create --name melnet -f environment.yml
conda activate melnet

Setup with Docker

  1. Download and install Docker
  2. Clone the source code with git:
git clone https://github.com/jgarciapueyo/MelNet-SpeechGeneration
cd MelNet-SpeechGeneration
  1. Create the image
docker build -f utils/docker/Dockerfile -t melnet .

or make build-container

  1. Run the container
docker run -it --rm --gpus all --mount src="$(pwd)",target=/app,type=bind melnet

or make run-container

Usage

Training

  1. Set up the project following the instructions in Setup.
  2. Download a dataset in the folder datasets/. As an example, the datasets Librispeech and LJSpeech can be downloaded by running
make data-librispeech
make data-ljspeech
  1. Create a YAML file for training a complete model (several tiers) on a dataset. This YAML will contain information about the architecture of the model and other parameters needed when transforming the audio waveforms of the dataset into melspectrograms. More information about the structure of the training YAML files can be found here.
  2. Train your MelNet model
python src/train.py -p models/params/{dataset}/{training_config_file}.yml

More options for training a model can be found here, like resuming training or specifying the tier/tiers of the model to be trained.
When training a model, it will automatically create a log file logs/general/{modelarchitecture}/{tier}_{timestamp}, a folder for tensorboard files logs/tensorboard/{modelarchitecture}_{timestamp}_{tier}/ and a folder for the weights of the model models/chkpt/{modelarchitecture}/ (each tier is stored separately in pickle format using .pt file extension).

Synthesis

After having trained a complete model (all the tiers), you can generate unconditionally spectrograms:

  1. Create a YAML file for performing synthesis. This YAML will contain information about the path to the weights of the tiers and the output folder. More information about the synthesis YAML file can be found here.
  2. Generate spectrograms
python src/synthesis.py -p models/params/{dataset}/{training_config_file}.yml -s models/params/{dataset}/{synthesis_config_file}.yml -t {timesteps_spectrogram}

When synthesizing a spectrogram, it will stored the spectrogram as an image and as a tensor in the path specified in the synthesis YAML file. It will save it in the tensorboard format in logs/tensorboard/{modelarchitecture}_{timestamp}_{tier}/ and create a log file logs/general/{modelarchitecture}/synthesis_{timestamp}.

Description of MelNet

# TODO
A complementing description to the original paper can be found in the Report of the project, adding new figures which can help understanding MelNet architecture.

Notes

This project is part of the course DD2465 Advanced, Individual Course in Computer Science during my studies at KTH.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].