Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

jtkim-kaist / Vad

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.

Programming Languages

matlab

3953 projects

Labels

data lstm speech-recognition attention speech dnn

Projects that are alternatives of or similar to Vad

Pytorch Kaldi

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Stars: ✭ 2,097 (+237.14%)

Mutual labels: lstm, speech-recognition, speech, dnn

Neural sp

End-to-end ASR/LM implementation with PyTorch

Stars: ✭ 408 (-34.41%)

Mutual labels: speech-recognition, attention, speech

Asr audio data links

A list of publically available audio data that anyone can download for ASR or other speech activities

Stars: ✭ 128 (-79.42%)

Mutual labels: data, speech-recognition, speech

dhs summit 2019 image captioning

Image captioning using attention models

Stars: ✭ 34 (-94.53%)

Mutual labels: lstm, attention

myDL

Deep Learning

Stars: ✭ 18 (-97.11%)

Mutual labels: dnn, lstm

Base-On-Relation-Method-Extract-News-DA-RNN-Model-For-Stock-Prediction--Pytorch

基於關聯式新聞提取方法之雙階段注意力機制模型用於股票預測

Stars: ✭ 33 (-94.69%)

Mutual labels: lstm, attention

automatic-personality-prediction

[AAAI 2020] Modeling Personality with Attentive Networks and Contextual Embeddings

Stars: ✭ 43 (-93.09%)

Mutual labels: lstm, attention

sova-asr

SOVA ASR (Automatic Speech Recognition)

Stars: ✭ 123 (-80.23%)

Mutual labels: speech, speech-recognition

spokestack-android

Extensible Android mobile voice framework: wakeword, ASR, NLU, and TTS. Easily add voice to any Android app!

Stars: ✭ 52 (-91.64%)

Mutual labels: speech, speech-recognition

voice-conversion

an tutorial implement of voice conversion using pytorch

Stars: ✭ 26 (-95.82%)

Mutual labels: dnn, lstm

Crnn attention ocr chinese

CRNN with attention to do OCR,add Chinese recognition

Stars: ✭ 315 (-49.36%)

Mutual labels: lstm, attention

Awesome Kaldi

This is a list of features, scripts, blogs and resources for better using Kaldi ( http://kaldi-asr.org/ )

Stars: ✭ 393 (-36.82%)

Mutual labels: speech-recognition, speech

Rus-SpeechRecognition-LSTM-CTC-VoxForge

Распознавание речи русского языка используя Tensorflow, обучаясь на базе Voxforge

Stars: ✭ 50 (-91.96%)

Mutual labels: lstm, speech-recognition

speech to text

how to use the Google Cloud Speech API to transcribe audio/video files.

Stars: ✭ 35 (-94.37%)

Mutual labels: speech, speech-recognition

ttslearn

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

Stars: ✭ 158 (-74.6%)

Mutual labels: speech, dnn

ntua-slp-semeval2018

Deep-learning models of NTUA-SLP team submitted in SemEval 2018 tasks 1, 2 and 3.

Stars: ✭ 79 (-87.3%)

Mutual labels: lstm, attention

speech-to-text

mixlingual speech recognition system; hybrid (GMM+NNet) model; Kaldi + Keras

Stars: ✭ 61 (-90.19%)

Mutual labels: dnn, speech-recognition

Java Speech Api

The J.A.R.V.I.S. Speech API is designed to be simple and efficient, using the speech engines created by Google to provide functionality for parts of the API. Essentially, it is an API written in Java, including a recognizer, synthesizer, and a microphone capture utility. The project uses Google services for the synthesizer and recognizer. While this requires an Internet connection, it provides a complete, modern, and fully functional speech API in Java.

Stars: ✭ 490 (-21.22%)

Mutual labels: speech-recognition, speech

UniSpeech

UniSpeech - Large Scale Self-Supervised Learning for Speech

Stars: ✭ 224 (-63.99%)

Mutual labels: speech, speech-recognition

Speech-Recognition

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

Stars: ✭ 21 (-96.62%)

Mutual labels: lstm, speech-recognition

View All Similar Projects ➔

Voice Activity Detection Toolkit

This toolkit provides the voice activity detection (VAD) code and our recorded dataset.

Update

2019-02-11

Accepted for the presentation of this toolkit in ICASSP 2019!

2018-12-11

Post processing is updated.

2018-06-04

Good news! we have uploaded speech enhancement toolkit based on deep neural network. This toolkit provides several useful things such as data generation script. You can find this toolkit in here

2018-04-09

The test sciprt fully written by python has been uploaded in 'py' branch.

Introduction

VAD toolkit in this project was used in the paper:

J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. PP, no. 99, pp. 1-1.

URL: https://ieeexplore.ieee.org/document/8309294/

If our VAD toolkit supports your research, we are very appreciated if you cite this paper.

ACAM is based on the recurrent attention model (RAM) [1] and the implementation of RAM can be found in jlindsey15 and jtkim-kaist's repository.

VAD in this toolkit follows the procedure as below:

Acoustic feature extraction

In this toolkit, we use the multi-resolution cochleagram (MRCG) [2] for the acoustic feature implemented by matlab. Note that MRCG extraction time is relatively long compared to the classifier.

Classifier

This toolkit supports 4 types of MRCG based classifer implemented by python with tensorflow as follows:

Adaptive context attention model (ACAM)
Boosted deep neural network (bDNN) [2]
Deep neural network (DNN) [2]
Long short term memory recurrent neural network (LSTM-RNN) [3]

Prerequisites

Python 3
Tensorflow 1.1-3
Matlab 2017b (will be depreciated)

Example

The default model provided in this toolkit is the trained model using our dataset. The used dataset is described in our submitted paper. The example matlab script is main.m. Just run it on the matlab. The result will be like following figure.

Note: To apply this toolkit to other speech data, the speech data should be sampled with 16kHz sampling frequency.

Post processing

Many people want to the post-processing so I updated.

In py branch, you can see some parameters in utils.vad_func in main.py

Each parameter can handle following errors.

FEC: hang_before

MSC: off_on_length

OVER: hang_over

NDS: on_off_length

Note that there is NO optimal one. The optimal parameter set is according to the application.

Enjoy.

Training

We attached the sample database to 'path/to/project/data/raw'. Please refer to the database for understanding the data format.
The model specifications are described in 'path/to/project/configure'.
The training procedure has 2 steps: (i) MRCG extraction; (ii) Model training.

Note: Do not forget adding the path to this project in the matlab.

# train.sh
# train script options
# m 0 : ACAM
# m 1 : bDNN
# m 2 : DNN
# m 3 : LSTM
# e : extract MRCG feature (1) or not (0)

python3 $train -m 0 -e 1 --prj_dir=$curdir

Recorded Dataset

Our recored dataset is freely available: Download

Specification

Environments

Bus stop, construction site, park, and room.

Recording device

A smart phone (Samsung Galaxy S8)

At each environment, conversational speech by two Korean male speakers was recorded. The ground truth labels are manually annotated. Because the recording was carried out in the real world, unexpected noises are included to the dataset such as the crying of baby, the chirping of insects, mouse click sound, and etc. The details of dataset is described in the following table:

	Bus stop	Cons. site	Park	Room	Overall
Dur. (min)	30.02	30.03	30.07	30.05	120.17
Avg. SNR (dB)	5.61	2.05	5.71	18.26	7.91
% of speech	40.12	26.71	26.85	30.44	31.03

TODO List

Although MRCG show good performance but extraction time is somewhat long, therefore we will substitute it to other feature such as spectrogram.

Trouble Shooting

If you find any errors in the code, please contact to us.

E-mail: [email protected]

Copyright

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

References

[1] J. Ba, V. Minh, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv, 1412.7755, 2014.

[2] Zhang, Xiao-Lei, and DeLiang Wang. “Boosting contextual information for deep neural network based voice activity detection,” IEEE Trans. Audio, Speech, Lang. Process., vol. 24, no. 2, pp. 252-264, 2016.

[3] Zazo Candil, Ruben, et al. “Feature learning with raw-waveform CLDNNs for Voice Activity Detection.”, 2016.

Acknowledgement

Jaeseok, Kim (KAIST) contributed to this project for changing matlab script to python.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 622

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (31) 🔗