Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → rishikksh20 → Vocgan

rishikksh20 / Vocgan

Licence: mit

VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

Programming Languages

python

139335 projects - #7 most used programming language

Labels

gan text-to-speech speech-synthesis speech-processing

Projects that are alternatives of or similar to Vocgan

open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

Stars: ✭ 841 (+432.28%)

Mutual labels: text-to-speech, speech-synthesis, speech-processing

IMS-Toucan

Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart. Objectives of the development are simplicity, modularity, controllability and multilinguality.

Stars: ✭ 295 (+86.71%)

Mutual labels: text-to-speech, speech-synthesis, speech-processing

Nnmnkwii

Library to build speech synthesis systems designed for easy and fast prototyping.

Stars: ✭ 308 (+94.94%)

Mutual labels: speech-synthesis, speech-processing, text-to-speech

react-native-spokestack

Spokestack: give your React Native app a voice interface!

Stars: ✭ 53 (-66.46%)

Mutual labels: text-to-speech, speech-synthesis, speech-processing

spokestack-ios

Spokestack: give your iOS app a voice interface!

Stars: ✭ 27 (-82.91%)

Mutual labels: text-to-speech, speech-synthesis, speech-processing

ttslearn

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

Stars: ✭ 158 (+0%)

Mutual labels: text-to-speech, speech-synthesis, speech-processing

Hifi Gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Stars: ✭ 325 (+105.7%)

Mutual labels: gan, speech-synthesis, text-to-speech

Openseq2seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

Stars: ✭ 1,378 (+772.15%)

Mutual labels: speech-synthesis, text-to-speech

Spokestack Python

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.

Stars: ✭ 103 (-34.81%)

Mutual labels: speech-synthesis, text-to-speech

Wavernn

WaveRNN Vocoder + TTS

Stars: ✭ 1,636 (+935.44%)

Mutual labels: speech-synthesis, text-to-speech

Deepvoice3 pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Stars: ✭ 1,654 (+946.84%)

Mutual labels: speech-synthesis, speech-processing

Zerospeech Tts Without T

A Pytorch implementation for the ZeroSpeech 2019 challenge.

Stars: ✭ 100 (-36.71%)

Mutual labels: gan, text-to-speech

Merlin

This is now the official location of the Merlin project.

Stars: ✭ 1,168 (+639.24%)

Mutual labels: speech-synthesis, text-to-speech

Tacotron Pytorch

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Stars: ✭ 104 (-34.18%)

Mutual labels: speech-synthesis, text-to-speech

Cs224n Gpu That Talks

Attention, I'm Trying to Speak: End-to-end speech synthesis (CS224n '18)

Stars: ✭ 52 (-67.09%)

Mutual labels: speech-synthesis, text-to-speech

Durian

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.

Stars: ✭ 111 (-29.75%)

Mutual labels: speech-synthesis, text-to-speech

Tacotron2

A PyTorch implementation of Tacotron2, an end-to-end text-to-speech(TTS) system described in "Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions".

Stars: ✭ 43 (-72.78%)

Mutual labels: speech-synthesis, text-to-speech

Crystal

Crystal - C++ implementation of a unified framework for multilingual TTS synthesis engine with SSML specification as interface.

Stars: ✭ 108 (-31.65%)

Mutual labels: speech-synthesis, text-to-speech

Tacotron 2

DeepMind's Tacotron-2 Tensorflow implementation

Stars: ✭ 1,968 (+1145.57%)

Mutual labels: speech-synthesis, text-to-speech

Awesome Ai Services

An overview of the AI-as-a-service landscape

Stars: ✭ 133 (-15.82%)

Mutual labels: speech-synthesis, text-to-speech

View All Similar Projects ➔

Modified VocGAN

This repo implements modified version of VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network using Pytorch, for actual VocGAN checkout to baseline branch. I bit modify the VocGAN's generator and used Full-Band MelGAN's discriminator instead of VocGAN's discriminator, as in my research I found MelGAN's discriminator is very fast while training and enough powerful to train Generator to produce high fidelity voice whereas VocGAN Hierarchically-nested JCU discriminator is quite huge and extremely slows the training process.

Tested on Python 3.6

pip install -r requirements.txt

Prepare Dataset

Download dataset for training. This can be any wav files with sample rate 22050Hz. (e.g. LJSpeech was used in paper)
preprocess: python preprocess.py -c config/default.yaml -d [data's root path]
Edit configuration yaml file

Train & Tensorboard

python trainer.py -c [config yaml file] -n [name of the run]
- cp config/default.yaml config/config.yaml and then edit config.yaml
- Write down the root path of train/validation files to 2nd/3rd line.
tensorboard --logdir logs/

Notes

This repo implements modified VocGAN for faster training although for true VocGAN implementation please checkout baseline branch, In my testing I am available to generate High-Fidelity audio in real time from Modified VocGAN.
Training cost for baseline VocGAN's Discriminator is too high (2.8 sec/it on P100 with batch size 16) as compared to Generator (7.2 it/sec on P100 with batch size 16), so it's unfeasible for me to train this model for long time.
May be we can optimizer baseline VocGAN's Discriminator by downsampling the audio on pre-processing stage instead of Training stage (currently I used torchaudio.transform.Resample as layer for downsampling the audio), this step might be speed-up overall Discriminator training.
I trained baseline model for 300 epochs (with batch size 16) on LJSpeech, and quality of generated audio is similar to the MelGAN at same epoch on same dataset. Author recommend to train model till 3000 epochs which is not feasible at current training speed (2.80 sec/it).
I am open for any suggestion and modification on this repo.

Inference

python inference.py -p [checkpoint path] -i [input mel path]

Pretrained models

Two pretrained model are provided. Both pretrained models are trained using modified-VocGAN structure.

LJSpeech-1.1 (English, single-female speaker, trained for 4000 epochs) [ download ]
KSS dataset (Korean, single-female speaker, trained for 4500 epochs) [ download ]
VCTK (English, multispeaker, trained for 3180 epochs) [download]

Audio Samples

Using pretrained models, we can reconstruct audio samples. Visit here to listen.

Results

[WIP]

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 158

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (5) 🔗