Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → mycrazycracy → Tf Kaldi Speaker

mycrazycracy / Tf Kaldi Speaker

Licence: apache-2.0

Neural speaker recognition/verification system based on Kaldi and Tensorflow

Programming Languages

python

139335 projects - #7 most used programming language

Labels

machine-learning tensorflow neural-network speech-processing kaldi

Projects that are alternatives of or similar to Tf Kaldi Speaker

Pytorch Kaldi Neural Speaker Embeddings

A light weight neural speaker embeddings extraction based on Kaldi and PyTorch.

Stars: ✭ 99 (-15.38%)

Mutual labels: speech-processing, kaldi

Elpis

🙊 WIP software for creating speech recognition models.

Stars: ✭ 101 (-13.68%)

Mutual labels: kaldi

Pncc

A implementation of Power Normalized Cepstral Coefficients: PNCC

Stars: ✭ 40 (-65.81%)

Mutual labels: speech-processing

Ivector Xvector

Extract xvector and ivector under kaldi

Stars: ✭ 67 (-42.74%)

Mutual labels: kaldi

Keras Sincnet

Keras (tensorflow) implementation of SincNet (Mirco Ravanelli, Yoshua Bengio - https://github.com/mravanelli/SincNet)

Stars: ✭ 47 (-59.83%)

Mutual labels: speech-processing

Vokaturiandroid

Emotion recognition by speech in android.

Stars: ✭ 79 (-32.48%)

Mutual labels: speech-processing

Pyannote Audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Stars: ✭ 978 (+735.9%)

Mutual labels: speech-processing

Wave U Net For Speech Enhancement

Implement Wave-U-Net by PyTorch, and migrate it to the speech enhancement.

Stars: ✭ 106 (-9.4%)

Mutual labels: speech-processing

Gcommandspytorch

ConvNets for Audio Recognition using Google Commands Dataset

Stars: ✭ 65 (-44.44%)

Mutual labels: speech-processing

Dragonfire

the open-source virtual assistant for Ubuntu based Linux distributions

Stars: ✭ 1,120 (+857.26%)

Mutual labels: kaldi

Fullsubnet

PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."

Stars: ✭ 51 (-56.41%)

Mutual labels: speech-processing

Nhyai

AI智能审查，支持色情识别、暴恐识别、语言识别、敏感文字检测和视频检测等功能，以及各种OCR识别能力，如身份证、驾照、行驶证、营业执照、银行卡、手写体、车牌和名片识别等功能，可以访问网站体验功能。

Stars: ✭ 60 (-48.72%)

Mutual labels: kaldi

Plda

An LDA/PLDA estimator using KALDI in python for speaker verification tasks

Stars: ✭ 85 (-27.35%)

Mutual labels: kaldi

Formant Analyzer

iOS application for finding formants in spoken sounds

Stars: ✭ 43 (-63.25%)

Mutual labels: speech-processing

Vosk Api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Stars: ✭ 1,357 (+1059.83%)

Mutual labels: kaldi

Voxceleb Ivector

Voxceleb1 i-vector based speaker recognition system

Stars: ✭ 36 (-69.23%)

Mutual labels: kaldi

Sptk

A modified version of Speech Signal Processing Toolkit (SPTK)

Stars: ✭ 71 (-39.32%)

Mutual labels: speech-processing

Tfg Voice Conversion

Deep Learning-based Voice Conversion system

Stars: ✭ 115 (-1.71%)

Mutual labels: speech-processing

Kaldi Gop

Computes the GMM-based Goodness of Pronunciation (GOP). Bases on Kaldi.

Stars: ✭ 104 (-11.11%)

Mutual labels: kaldi

Factorized Tdnn

PyTorch implementation of the Factorized TDNN (TDNN-F) from "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks" and Kaldi

Stars: ✭ 98 (-16.24%)

Mutual labels: kaldi

View All Similar Projects ➔

Important Note:

When you extract the speaker embedding using extract.sh, make sure that your TensorFlow is compiled WITHOUT MKL. As I know, some versions of TF installed by anaconda are compiled with MKL. It will use multiple threads when TF is running on CPUs. This is harmful if you run multiple processes (say 40). The threads conflict will make the extraction extreamly slow. For me, I use pip to install TF 1.12, and that works.

Overview

The tf-kaldi-speaker implements a neural network based speaker verification system using Kaldi and TensorFlow.

The main idea is that Kaldi can be used to do the pre- and post-processings while TF is a better choice to build the neural network. Compared with Kaldi nnet3, the modification of the network (e.g. adding attention, using different loss functions) using TF costs less. Adding other features to support text-dependent speaker verification is also possible.

The purpose of the project is to make researches on neural network based speaker verification easier. I also try to reproduce some results in my papers.

Requirement

Python: 2.7 (Update to 3.6/3.7 should be easy.)
Kaldi: >5.5

Since Kaldi is only used to do the pre- and post-processing, most version >5.2 works. Though I'm not 100% sure, I believe Kaldi with x-vector support (e.g. egs/sre16/v2) is enough. But if you want to run egs/voxceleb, make sure your Kaldi also contains this examples.
Tensorflow: >1.4.0

I write the code using TF 1.4.0 at the very beginning. Then I updated to v1.12.0. The future version will support TF >1.12 but I will try to make the API compatible with lower versions. Due to the API changes (e.g. keep_dims to keepdims in some functions), some may experience incorrect parameters. In that case, simply check the parameters may fix these problems.

Methodology

The general pipeline of our framework is:

For training:

Kaldi: Data preparation --> feature extraction --> training example generateion (CMVN + VAD + ...)
TF: Network training (training examples + nnet config)

For test:

Kaldi: Data preparation --> feature extraction
TF: Embedding extraction
Kaldi: Backend classifier (Cosine/PLDA) --> performance evaluation

Evaluate the performance:
- MATLAB is used to compute the EER, minDCF08, minDCF10, minDCF12.
- If you do not have MATLAB, Kaldi also provides scripts to compute the EER and minDCFs. The minDCF08 from Kaldi is 10x larger than DETware due to the computation method.

In our framework, the speaker embedding can be trained and extracted using different network architectures. Again, the backend classifier is integrated using Kaldi.

Features

Entire pipeline of neural network based speaker verification.
Both training from scratch and fine-tuning a pre-trained model are supported.
Standard x-vector architecture (with minor modification).
Angular softmax, additive margin softmax, additive angular margin softmax, triplet loss and other loss functions.
Self attention and other attention methods.
Multi-GPU training supported. Since we use data parallelism, the data and gradients are distributed between GPUs every step. This may limit the speed when there are too many GPUs are used. But it works well for 2/4/8 GPUs which is okay for me.
Examples including VoxCeleb and SRE. Refer to Fisher to customized the dataset. A standard VoxCeleb example which uses the official training list (i.e. the VoxCeleb2 dev set) in egs/voxceleb/v2 (using tdnn) and v3 (using resnet).

Usage

The demos for SRE and VoxCeleb are included in egs/{sre,voxceleb}. Follow run.sh to go through the code.
The neural networks are configured using JSON files which are included in nnet_conf and the usage of the parameters is exhibited in the demos.

Performance & Speed

Performance

I've test the code on three datasets and the results are better than the standard Kaldi recipe. (Of course, you can achieve better performance using Kaldi by carefully tuning the parameters.)

See RESULTS for details.
Speed

Since it only support single gpu, the speed is not very fast but acceptable in medium-scale datasets. For VoxCeleb, the training takes about 2.5 days using Nvidia P100 and it takes ~4 days for SRE.

The speed can be accelerated if multi-gpus are used.

Pretrained models

VoxCeleb

Training data: VoxCeleb1 dev set and VoxCeleb2

Google Drive and

BaiduYunDisk (extraction code: xwu6)
NIST SRE

Training data: NIST SRE04-08, SWBD

Only the models trained with large margin softmax are released at this moment.

Google Drive and

BaiduYunDisk (extraction code: rt9p)

Pros and cons

Advantages
1. Performance: The performance of our code is shown to perform better than Kaldi.
2. Storage: There is no need to generate a packed egs as Kaldi when training the network. The training will load the data on the fly.
3. Flexibility: Changing the network architecture and loss function is pretty easy.
Disadvantages
1. Since no packed egs are generated. Multiple CPUs must be used to load the data during training.

Other discussions

In this code, I provide two possible methods to tune the learning rate when SGD is used: using validation set and using fixed file. The first method works well but it may take longer to train the network.
David Snyder and Dan Povey just released a new paper about the diarization performance using x-vector. The network in that paper is updated to more than 10 layers. You may like to change model/tdnn.py to implement the new network. Resnet is also used in many works. I haven't done anything to find the best network architecture. Deeper network is worth trying since we have enough training data.

License

Apache License, Version 2.0 (Refer to LICENCE)

Acknowledgements

The computational resources are initially provided by Prof. Mark Gales in Cambridge University Engineering Department (CUED), and now mainly supported by Dr. Liang He in Tsinghua University Electronic Engineering Department (THUEE).

Last ...

Unfortunately, the code is developed under Windows. The file property cannot be maintained properly. After downloading the code, simply run:
```
find ./ -name "*.sh" | awk '{print "chmod +x "$1}' | sh
```
to add the 'x' property to the .sh files.
For cluster setup, please refer to Kaldi for help. In my case, the program is run locally. Modify cmd.sh and path.sh just according to standard Kaldi setup.
If you encounter any problems, please make an issue.
If you have any extensions, feel free to create a PR.
Details (configurations, adding network components, etc.) will be updated later.
Contact:

Website: http://yiliu.org.cn

E-mail: liu-yi15 (at) mails (dot) tsinghua (dot) edu (dot) cn

Related papers

For large margin softmax loss, please cite:

@inproceedings{liu2019speaker,
   author={Yi Liu and Liang He and Jia Liu},
   Title = {Large Margin Softmax Loss for Speaker Verification},
   BookTitle = {Proc. INTERSPEECH},
   Year = {2019}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 117

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗