Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → israelg99 → Deepvoice

israelg99 / Deepvoice

Licence: apache-2.0

Deep Voice: Real-time Neural Text-to-Speech

Programming Languages

python

139335 projects - #7 most used programming language

Labels

deep-learning machine-learning keras voice

Projects that are alternatives of or similar to Deepvoice

vonage-node-code-snippets

NodeJS code examples for using Nexmo

Stars: ✭ 46 (-84.51%)

Mutual labels: voice

Voice-Denoising-AN

A Conditional Generative Adverserial Network (cGAN) was adapted for the task of source de-noising of noisy voice auditory images. The base architecture is adapted from Pix2Pix.

Stars: ✭ 42 (-85.86%)

Mutual labels: voice

Noisetorch

Real-time microphone noise suppression on Linux.

Stars: ✭ 5,199 (+1650.51%)

Mutual labels: voice

ios-swift-chat-app

Open-source Voice & Video Calling and Text Chat App for Swift (iOS)

Stars: ✭ 111 (-62.63%)

Mutual labels: voice

somleng-switch

SomlengSWITCH. Somleng's TwiML Engine, FreeSWITCH configuration and infrastructure.

Stars: ✭ 16 (-94.61%)

Mutual labels: voice

ultimate-guide-to-voice-assistants

Curation of startups, resources, people, posts etc in the voice-space

Stars: ✭ 55 (-81.48%)

Mutual labels: voice

vasisualy

Vasisualy it's a simple Russian voice assistant written on Python for GNU/Linux, Windows and Android.

Stars: ✭ 33 (-88.89%)

Mutual labels: voice

Alan Sdk Android

Alan AI Android SDK adds a voice assistant or chatbot to your app. Supports Java, Kotlin.

Stars: ✭ 278 (-6.4%)

Mutual labels: voice

download audioset

📁 This repo makes it easy to download the raw audio files from AudioSet (32.45 GB, 632 classes).

Stars: ✭ 53 (-82.15%)

Mutual labels: voice

Neural Voice Cloning With Few Samples

This repository has implementation for "Neural Voice Cloning With Few Samples"

Stars: ✭ 262 (-11.78%)

Mutual labels: voice

TeamCord

Cross voice communication between Teamspeak and Discord

Stars: ✭ 35 (-88.22%)

Mutual labels: voice

ios-swift-chat-ui-kit

Ready-to-use Chat UI Components for Swift (iOS)

Stars: ✭ 42 (-85.86%)

Mutual labels: voice

assister

Private Open General Assistant Platform

Stars: ✭ 42 (-85.86%)

Mutual labels: voice

saltychat-fivem

FiveM implementation of Salty Chat (TeamSpeak 3 based Voice Plugin)

Stars: ✭ 64 (-78.45%)

Mutual labels: voice

Discord Rs

Rust library for the Discord chat client API

Stars: ✭ 272 (-8.42%)

Mutual labels: voice

gm 8bit

Audio interception utils for garry's mod

Stars: ✭ 28 (-90.57%)

Mutual labels: voice

AlexaAndroid

No description or website provided.

Stars: ✭ 15 (-94.95%)

Mutual labels: voice

Alan Sdk Ionic

Alan AI Ionic SDK adds a voice assistant or chatbot to your app. Supports React, Angular.

Stars: ✭ 287 (-3.37%)

Mutual labels: voice

Disgord

Go module for interacting with the documented Discord's bot interface; Gateway, REST requests and voice

Stars: ✭ 277 (-6.73%)

Mutual labels: voice

Vmsg

🎵 Library for creating voice messages

Stars: ✭ 257 (-13.47%)

Mutual labels: voice

View All Similar Projects ➔

Deep Voice

Based on the Deep Voice paper.

This repository depends on my Keras fork until it is merged with the official Keras repository.
To install: pip3 install git+https://github.com/israelg99/keras.git
This will override your previously installed Keras version.

Deep Voice is a text-to-speech system based entirely on deep neural networks.

Deep Voice comprises five models:

Grapheme-to-phoneme converter.
Phoneme Segmentation.
Phoneme duration predictor.
Frequency predictor.
Audio synthesis.

Grapheme-to-phoneme

Abstract

The grapheme-to-phoneme converter converts from written text (e.g English characters) to phonemes (encoded using a phonemic alphabet such as ARPABET).

Architecture

Based on this architecture but with some changes.

The Grapheme-to-phoneme converter is an encoder-decoder:

Encoder: multi-layer, bidirectional encoder, with a gated recurrent unit (GRU) nonlinearity.
Decoder: identical to the encoder but unidirectional.

It takes written text as input.

Setup

Initialization: every decoder layer is initialized to the final hidden state of the corresponding encoder forward layer.
Training: the architecture is trained with teacher forcing.
Decoding: performed using beam search.

Hyperparameters

Encoder: 3 bidirectional layers with 1024 units each.
Decoder: 3 unidirectional layers of the same size as the encoder.
Beam Search: width of 5 candidates.
Dropout: 0.95 rate after each recurrent layer.

Phoneme Segmentation

Abstract

The phoneme segmentation model locates phoneme boundaries in the voice dataset.
Given an audio file and a phoneme-by-phoneme transcription of the audio, the segmentation model identifies where in the audio each phoneme begins and ends.
The phoneme segmentation model is trained to output the alignment between a given utterance and a sequence of target phonemes. This task is similar to the problem of aligning speech to written output in speech recognition.

Architecture

The segmentation model uses the convolutional recurrent neural network based on Deep Speech 2.

The architecture graph

Audio vector.
20 MFCCs with 10ms stride.
Double 2D convolutions (frequency bins * time).
Triple bidirectional recurrent GRUs.
Softmax.
Output sequence of pairs.

Hyperparameters

Convolutions

Stride: (9, 5).
Dropout: 0.95 rate after last convolution.

Recurrent layers

Dimensionality: 512 GRU cells for each direction.
Dropout: 0.95 rate after the last recurrent layer.

Training

The segmentation model uses the connectionist temporal classification (CTC) loss.

Phoneme Duration + Frequency Predictor

Abstract

A single architecture is used to jointly predict phoneme duration and time-dependent fundamental frequency.

Phoneme Duration Abstract

The phoneme duration predictor predicts the temporal duration of every phoneme in a phoneme sequence (an utterance).

Frequency Predictor Abstract

The frequency predictor predicts whether a phoneme is voiced. If it is, the model predicts the fundamental frequency (F0) throughout the phoneme’s duration.

Architecture

A sequence of phonemes with stresses, encoded in one-hot vector.
Double fully-connected layers.
Double unidirectional recurrent layers.
Fully-connected layer.

Hyperparameters

Double fully-connected layers

Dimensionality: 256.
Dropout: 0.8 rate after last layer.

Double unidirectional recurrent layers

Dimensionality: 128 GRUs.
Dropout: 0.8 rate after last layer.

Audio Synthesis

Abstract

Combines the outputs of the grapheme-to-phoneme, phoneme duration, and frequency predictor models.
Synthesizes audio at a high sampling rate, corresponding to the desired text.
Uses a WaveNet variant which requires less parameters and is faster to train.

Architecture

The architecture is based on WaveNet but with some changes.

Will be updated soon.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 297

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (6) 🔗