All Projects → m-toman → Tacorn

m-toman / Tacorn

Licence: mit
2018/2019 TTS framework integrating state of the art open source methods

Projects that are alternatives of or similar to Tacorn

Juypter Notebooks
neural network explorations ⚡️ i know it's misspelled
Stars: ✭ 46 (-2.13%)
Mutual labels:  jupyter-notebook
Spinal Bootcamp
SpinalHDL-tutorial based on Jupyter Notebook
Stars: ✭ 47 (+0%)
Mutual labels:  jupyter-notebook
Pytorch connectomics
PyTorch Connectomics: segmentation toolbox for EM connectomics
Stars: ✭ 46 (-2.13%)
Mutual labels:  jupyter-notebook
Ipds Kr
<따라 하며 배우는 데이터 과학> (2017) 소스코드
Stars: ✭ 46 (-2.13%)
Mutual labels:  jupyter-notebook
Animefacenotebooks
notebooks and some data for playing with animeface stylegan2 and deepdanbooru
Stars: ✭ 47 (+0%)
Mutual labels:  jupyter-notebook
Gpt2 French
GPT-2 French demo | Démo française de GPT-2
Stars: ✭ 47 (+0%)
Mutual labels:  jupyter-notebook
Variantnet
A simple neural network for calling het-/hom-variants from alignments of single molecule reads to a reference
Stars: ✭ 46 (-2.13%)
Mutual labels:  jupyter-notebook
Multilateration
Multilateration in 2D: IoT/LoRaWAN Mass Surveillance
Stars: ✭ 47 (+0%)
Mutual labels:  jupyter-notebook
Sub Gc
Code repository for our paper "Comprehensive Image Captioning via Scene Graph Decomposition" in ECCV 2020.
Stars: ✭ 46 (-2.13%)
Mutual labels:  jupyter-notebook
Play With Machine Learning Algorithms
Code of my MOOC Course <Play with Machine Learning Algorithms>. Updated contents and practices are also included. 我在慕课网上的课程《Python3 入门机器学习》示例代码。课程的更多更新内容及辅助练习也将逐步添加进这个代码仓。
Stars: ✭ 1,037 (+2106.38%)
Mutual labels:  jupyter-notebook
Flashlight
Flashlight is a lightweight Python library for analyzing and solving quadrotor control problems.
Stars: ✭ 46 (-2.13%)
Mutual labels:  jupyter-notebook
Tensorflow Ipy
VM with the TensorFlow library from Google
Stars: ✭ 46 (-2.13%)
Mutual labels:  jupyter-notebook
Price Forecaster
Forecasting the future prices of BTC and More using Machine and Deep Learning Models
Stars: ✭ 47 (+0%)
Mutual labels:  jupyter-notebook
Eegclassificationmcnn
Solution for EEG Classification via Multiscale Convolutional Net coded for NeuroHack at Yandex.
Stars: ✭ 46 (-2.13%)
Mutual labels:  jupyter-notebook
Ligdream
Novel molecules from a reference shape!
Stars: ✭ 47 (+0%)
Mutual labels:  jupyter-notebook
Deepdream pytorch
Stars: ✭ 46 (-2.13%)
Mutual labels:  jupyter-notebook
Detectron2 instance segmentation demo
How to train Detectron2 with Custom COCO Datasets | DLology
Stars: ✭ 47 (+0%)
Mutual labels:  jupyter-notebook
Ai Team
清华大学AI自强项目课件以及代码下载,黑龙江大学机器学习小组学习历程。@清华大学数据院,感谢他们的课件以及源码
Stars: ✭ 47 (+0%)
Mutual labels:  jupyter-notebook
Meta Prod2vec
Repository for experiments with MetaProd2Vec and related algorithms.
Stars: ✭ 47 (+0%)
Mutual labels:  jupyter-notebook
Al Go Rithms
🎵 Algorithms written in different programming languages - https://zoranpandovski.github.io/al-go-rithms/
Stars: ✭ 1,036 (+2104.26%)
Mutual labels:  jupyter-notebook

WARNING: This repository was an experiment and is not maintained anymore.

tacorn

TTS framework bridging different 2018/2019 state-of-the-art open source methods. Currently aims to combine the Tacotron-2 implementation by Rayhane-mamah (https://github.com/Rayhane-mamah/Tacotron-2) with a fork of the alternative WaveRNN implementation by fatchord (https://github.com/fatchord/WaveRNN). The overall goal is to more easily allow swapping out single components.

Introduction

Speech synthesis systems consist of multiple components which have traditionally been developed manually and are increasingly being replaced by machine learning models.

Here we define three components used in statistical parametric speech synthesis. We don't consider unit selection or hybrid unit selection systems or physical modeling based systems.

Data flows along those components, producing intermediate representations which are then input to the next component. While in training we typically deal with large datasets and intermediate representations are typically stored on hard disk, we want to avoid this at synthesis time and aim to hold everything in memory.

Text analysis

Component to generate a linguistic specification from text input.

Traditionally this involves hand-coded language specific rules, a pronuncation dictionary, letter-to-sound (or grapheme-to-phoneme) model for out of dictionary words and potentially additional models, e.g. ToBI endtone prediction, part of speech tagging, phrasing prediction etc. The result for a given input sentence is a sequence of linguistic specifications, for example encoded as HTK labels. This specification typically at least holds a sequence of phones (or phonemes) but typically also includes contextual information like surrounding phones, interpunctuation, counts for segments, syllables, words, phrases etc. (see for example https://github.com/MattShannon/HTS-demo_CMU-ARCTIC-SLT-STRAIGHT-AR-decision-tree/blob/master/data/lab_format.pdf). Examples for actual systems to perform text analysis are Festival, Flite or Ossian (REFs).

Recent systems take a step towards end-to-end synthesis and aim to replace those often complex codebases by machine learning models. Here we focus on Tacotron (REF).

Acoustic feature prediction

Component consuming a linguistic specification to predict some sort of intermediate acoustic representation.

Intermediate acoustic representations are used because of useful properties for modeling but also because they are typically using a lower time resolution than the raw waveforms. Almost all commonly used representations employ a Fourier transformation, so for example with a commonly used window shift of 5ms we end up with only 200 feature vectors per second instead of 48000 for 48kHz speech. Examples for commonly used features include Mel-Frequency Cepstral Coefficents (MFCCs) and Line Spectral Pairs (LSPs). Furthermore, additional features like fundamental frequency (F0) or aperiodicity features are commonly used.

The acoustic feature prediction component traditionally often employed a separate duration model to predict the number of acoustic features to be generated for each segment (i.e. phone), then an acoustic model to predict the actual acoustic features. Here we focus on Tacotron, which employs an attention-baed sequence to sequence model to merge duration and acoustic feature prediction into a single model.

Waveform generation

Component generating waveforms from acoustic features.

The component performing this operation is often called a Vocoder and traditionally involves signal processing to encode and decode speech. Examples for Vocoders are STRAIGHT, WORLD, hts_engine, GlottHMM, GlottDNN or Vocaine.

Recently neural vocoders were employed with good success and include WaveNet, WaveRNN, WaveGlow, FFTNet and SampleRNN (REFs). The main disadvantage of neural vocoders is that they are yet another model that has to be trained, typically even per speaker. This now only means additional computing resources and time required but also complicates deployment and requires additional hyperparameter tuning for this model. Possibilities to work around this include multi-speaker models or speaker-independent modeling (https://arxiv.org/abs/1811.06292).

Here we focus on WaveRNN although the currently included Tacotron-2 implementation by Rayhane-mamah also includes WaveNet.

Experiment folder contents

  • config: holds configurations for the experiment and the components.
  • raw: input corpus.
  • raw/wavs: input waveforms.
  • raw/meta: input meta information, typically at least a transcription.
  • features: holds intermediate representations used in training and synthesis.
  • features/acoustic: holds preprocessed features for acoustic model training, e.g. mel spectrum, linguistic specifications.
  • features/acoustic2wavegen: holds output features from acoustic used as input to wavegen.
  • features/acoustic2wavegen/training: holds output features from acoustic used as input to wavegen training (e.g. ground-truth-aligned mel spectra).
  • features/acoustic2wavegen/synthesis: holds output features from acoustic used as input to wavegen synthesis (e.g. mel spectra).
  • features/wavegen: holds input features for the waveform generation model training, e.g. mel spectrum and raw waveforms.
  • models: working directories for models/components.
  • models/acoustic: working directory for the acoustic feature prediction component.
  • models/wavegen: working directory for the waveform generation component.
  • synthesized: synthesized wavefiles and metainformation.

Process

Create

  • Input: Configuration parameters
  • Output: Configured experiment directory
  • Invocation: create.py

Creates a new experiment directory.

Preprocessing

  • Input: corpus in raw or given by parameter
  • Output: processed features in features or acoustic_model
  • Invocation: preprocess.py

Preprocessing waveforms and orthographic transcription.

Training

  • Input: processed features
  • Output: trained models in acoustic_model and wavegen_model
  • Invocation: train.py

Train feature prediction and neural vocoder models.

Synthesis

  • Input: text, trained models in acoustic_modeland wavegen_model
  • Output: wavefiles in synthesized_wavs
  • Invocation: synthesis.py

Export

TODO

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].