All Projects → jefflai108 → Pytorch Kaldi Neural Speaker Embeddings

jefflai108 / Pytorch Kaldi Neural Speaker Embeddings

Licence: bsd-3-clause
A light weight neural speaker embeddings extraction based on Kaldi and PyTorch.

Programming Languages

perl
6916 projects

Projects that are alternatives of or similar to Pytorch Kaldi Neural Speaker Embeddings

Tf Kaldi Speaker
Neural speaker recognition/verification system based on Kaldi and Tensorflow
Stars: ✭ 117 (+18.18%)
Mutual labels:  speech-processing, kaldi
Ivector Xvector
Extract xvector and ivector under kaldi
Stars: ✭ 67 (-32.32%)
Mutual labels:  kaldi
Espresso
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
Stars: ✭ 808 (+716.16%)
Mutual labels:  kaldi
Keras Sincnet
Keras (tensorflow) implementation of SincNet (Mirco Ravanelli, Yoshua Bengio - https://github.com/mravanelli/SincNet)
Stars: ✭ 47 (-52.53%)
Mutual labels:  speech-processing
Rte Speech Generator
Natural Language Processing to generate new speeches for the President of Turkey.
Stars: ✭ 22 (-77.78%)
Mutual labels:  speech-processing
Dnc
Discriminative Neural Clustering for Speaker Diarisation
Stars: ✭ 60 (-39.39%)
Mutual labels:  speech-processing
Pykaldi
A Python wrapper for Kaldi
Stars: ✭ 756 (+663.64%)
Mutual labels:  kaldi
Plda
An LDA/PLDA estimator using KALDI in python for speaker verification tasks
Stars: ✭ 85 (-14.14%)
Mutual labels:  kaldi
Gcommandspytorch
ConvNets for Audio Recognition using Google Commands Dataset
Stars: ✭ 65 (-34.34%)
Mutual labels:  speech-processing
Formant Analyzer
iOS application for finding formants in spoken sounds
Stars: ✭ 43 (-56.57%)
Mutual labels:  speech-processing
Pncc
A implementation of Power Normalized Cepstral Coefficients: PNCC
Stars: ✭ 40 (-59.6%)
Mutual labels:  speech-processing
Theano Kaldi Rnn
THEANO-KALDI-RNNs is a project implementing various Recurrent Neural Networks (RNNs) for RNN-HMM speech recognition. The Theano Code is coupled with the Kaldi decoder.
Stars: ✭ 31 (-68.69%)
Mutual labels:  kaldi
Nhyai
AI智能审查,支持色情识别、暴恐识别、语言识别、敏感文字检测和视频检测等功能,以及各种OCR识别能力,如身份证、驾照、行驶证、营业执照、银行卡、手写体、车牌和名片识别等功能,可以访问网站体验功能。
Stars: ✭ 60 (-39.39%)
Mutual labels:  kaldi
Kaldi Io
c++ Kaldi IO lib (static and dynamic).
Stars: ✭ 22 (-77.78%)
Mutual labels:  kaldi
Sptk
A modified version of Speech Signal Processing Toolkit (SPTK)
Stars: ✭ 71 (-28.28%)
Mutual labels:  speech-processing
Sincnet
SincNet is a neural architecture for efficiently processing raw audio samples.
Stars: ✭ 764 (+671.72%)
Mutual labels:  speech-processing
Voxceleb Ivector
Voxceleb1 i-vector based speaker recognition system
Stars: ✭ 36 (-63.64%)
Mutual labels:  kaldi
Fullsubnet
PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."
Stars: ✭ 51 (-48.48%)
Mutual labels:  speech-processing
Factorized Tdnn
PyTorch implementation of the Factorized TDNN (TDNN-F) from "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks" and Kaldi
Stars: ✭ 98 (-1.01%)
Mutual labels:  kaldi
Vokaturiandroid
Emotion recognition by speech in android.
Stars: ✭ 79 (-20.2%)
Mutual labels:  speech-processing

pytorch-kaldi-neural-speaker-embeddings

A light weight neural speaker embeddings extraction based on Kaldi and PyTorch.
The repository serves as a starting point for users to reproduce and experiment several recent advances in speaker recognition literature. Kaldi is used for pre-processing and post-processing and PyTorch is used for training the neural speaker embeddings. I want to note that this repo is not meant for keeping track of state-of-the-art on speaker recognition, and most likely the models will be considered outdated in a few months (or sooner :().

This repository contains a PyTorch+Kaldi pipeline to reproduce the core results for:

With some modifications, you can easily adapt the pipeline for:

If one wants to go further, take a look at our recent work on multi-speaker text-to-speech, where the same speaker embeddings are employed to model speaker characterisitcs in a text-to-speech system.

Lastly, kindly cite our paper(s) if you find this repository useful. Cite both if you are kind enough!

@article{villalba2019state,
  title={State-of-the-art speaker recognition with neural network embeddings in nist sre18 and speakers in the wild evaluations},
  author={Villalba, Jes{\'u}s and Chen, Nanxin and Snyder, David and Garcia-Romero, Daniel and McCree, Alan and Sell, Gregory and Borgstrom, Jonas and Garc{\'\i}a-Perera, Leibny Paola and Richardson, Fred and Dehak, R{\'e}da and others},
  journal={Computer Speech \& Language},
  pages={101026},
  year={2019},
  publisher={Elsevier}
}
@article{cooper2019zero,
  title={Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings},
  author={Cooper, Erica and Lai, Cheng-I and Yasuda, Yusuke and Fang, Fuming and Wang, Xin and Chen, Nanxin and Yamagishi, Junichi},
  journal={arXiv preprint arXiv:1910.10838},
  year={2019}
}

One should also check out the very nicely written TensorFlow version by Yi Lu.

Overview

Neural speaker embeddings: Encoder --> Pooling --> Classification
LDE pooling method illustration:

Requirements

pip install -r requirements.txt Please also download and properly setup Kaldi. If you are stuck in this phase, this repository is liekly not for you.

Getting Started

The bash file pipeline.sh contains the 12-stage speaker recognition pipeline, including feature extraction, the neural model training and decoding/evaluation. A more detailed description of each step is described in pipeline.sh. To get started, simply run: ./pipeline.sh

Datasets

The models are trained on VoxCeleb I+II, which is free for downloads (the trial lists are also there). One can easily adapt pipeline.sh for different datasets.

Pre-Trained Models

Due to Youtube's privacy policy, unfortunately I am not allowed to upload pre-trained models for VoxCeleb I+II.

Benchmarking Speaker Verification EERs

Embedding name dimension normalization pooling type train objective EER DCFmin0.01
i-vectors 400 no mean EM 5.329 0.493
x-vectors 512 no mean, std Softmax 3.298 0.343
x-vectorsN 512 yes mean, std Softmax 3.213 0.342
LDE-1 512 no mean Softmax 3.415 0.366
LDE-1N 512 yes mean Softmax 3.446 0.365
LDE-2 512 no mean ASoftmax (m=2) 3.674 0.364
LDE-2N 512 yes mean ASoftmax (m=2) 3.664 0.386
LDE-3 512 no mean ASoftmax (m=3) 3.033 0.314
LDE-3N 512 yes mean ASoftmax (m=3) 3.171 0.327
LDE-4 512 no mean ASoftmax (m=4) 3.112 0.315
LDE-4N 512 yes mean ASoftmax (m=4) 3.271 0.327
LDE-5 256 no mean ASoftmax (m=2) 3.287 0.343
LDE-5N 256 yes mean ASoftmax (m=2) 3.367 0.351
LDE-6 200 no mean ASoftmax (m=2) 3.266 0.396
LDE-6N 200 yes mean ASoftmax (m=2) 3.266 0.396
LDE-7 512 no mean, std ASoftmax (m=2) 3.091 0.303
LDE-7N 512 yes mean, std ASoftmax (m=2) 3.171 0.328

Using Speaker Embeddings for Tacotron2 Speaker Adaptation

Speaker Embedding Space Visualization (cluster by speakers)

i-vectors (baseline)

LDE

Benchmarking TTS MOS scores

Embedding name Naturalness dev Naturalness test Similarity dev Similarity test
vocoded 3.41 3.55 2.79 2.82
x-vectorsN 3.19 3.19 1.86 2.37
LDE-1 3.16 3.21 2.05 2.34
LDE-1N 3.13 3.46 1.97 2.45
LDE-2 3.28 3.35 2.00 2.37
LDE-2N 3.19 3.33 2.00 2.35
LDE-3 3.24 3.48 1.88 2.46
LDE-3N 3.16 3.33 2.00 2.37
LDE-4 3.10 3.29 2.00 2.31
LDE-4N 3.20 3.29 1.98 2.39
LDE-5 3.26 3.40 1.99 2.45
LDE-5N 3.07 3.37 2.02 2.41
LDE-6 3.25 3.33 1.95 2.43
LDE-6N 3.29 3.23 1.94 2.39
LDE-7 3.03 3.18 1.86 2.28
LDE-7N 3.02 3.24 2.02 2.42

Credits

Base code written by Nanxin Chen, Johns Hopkins University
Experiments done by Cheng-I Lai, MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].