All Projects → vishalshar → SpeakerDiarization_RNN_CNN_LSTM

vishalshar / SpeakerDiarization_RNN_CNN_LSTM

Licence: other
Speaker Diarization is the problem of separating speakers in an audio. There could be any number of speakers and final result should state when speaker starts and ends. In this project, we analyze given audio file with 2 channels and 2 speakers (on separate channels).

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to SpeakerDiarization RNN CNN LSTM

tiny-rnn
Lightweight C++11 library for building deep recurrent neural networks
Stars: ✭ 41 (-26.79%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Deepseqslam
The Official Deep Learning Framework for Route-based Place Recognition
Stars: ✭ 49 (-12.5%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
sgrnn
Tensorflow implementation of Synthetic Gradient for RNN (LSTM)
Stars: ✭ 40 (-28.57%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Lstm Human Activity Recognition
Human Activity Recognition example using TensorFlow on smartphone sensors dataset and an LSTM RNN. Classifying the type of movement amongst six activity categories - Guillaume Chevalier
Stars: ✭ 2,943 (+5155.36%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Linear Attention Recurrent Neural Network
A recurrent attention module consisting of an LSTM cell which can query its own past cell states by the means of windowed multi-head attention. The formulas are derived from the BN-LSTM and the Transformer Network. The LARNN cell with attention can be easily used inside a loop on the cell state, just like any other RNN. (LARNN)
Stars: ✭ 119 (+112.5%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
automatic-personality-prediction
[AAAI 2020] Modeling Personality with Attentive Networks and Contextual Embeddings
Stars: ✭ 43 (-23.21%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Rnnsharp
RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling, sequence-to-sequence and so on. It's written by C# language and based on .NET framework 4.6 or above versions. RNNSharp supports many different types of networks, such as forward and bi-directional network, sequence-to-sequence network, and different types of layers, such as LSTM, Softmax, sampled Softmax and others.
Stars: ✭ 277 (+394.64%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
sequence-rnn-py
Sequence analyzing using Recurrent Neural Networks (RNN) based on Keras
Stars: ✭ 28 (-50%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Pytorch Learners Tutorial
PyTorch tutorial for learners
Stars: ✭ 97 (+73.21%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Pytorch Pos Tagging
A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.
Stars: ✭ 96 (+71.43%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Bitcoin Price Prediction Using Lstm
Bitcoin price Prediction ( Time Series ) using LSTM Recurrent neural network
Stars: ✭ 67 (+19.64%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Rnn ctc
Recurrent Neural Network and Long Short Term Memory (LSTM) with Connectionist Temporal Classification implemented in Theano. Includes a Toy training example.
Stars: ✭ 220 (+292.86%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Pytorch Kaldi
pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.
Stars: ✭ 2,097 (+3644.64%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+5630.36%)
Mutual labels:  recurrent-neural-networks, lstm, rnn
Human-Activity-Recognition
Human activity recognition using TensorFlow on smartphone sensors dataset and an LSTM RNN. Classifying the type of movement amongst six categories (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING).
Stars: ✭ 16 (-71.43%)
Mutual labels:  recurrent-neural-networks, rnn
Probabilistic-RNN-DA-Classifier
Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model
Stars: ✭ 22 (-60.71%)
Mutual labels:  recurrent-neural-networks, rnn
lstm har
LSTM based human activity recognition using smart phone sensor dataset
Stars: ✭ 20 (-64.29%)
Mutual labels:  lstm, rnn
DrowsyDriverDetection
This is a project implementing Computer Vision and Deep Learning concepts to detect drowsiness of a driver and sound an alarm if drowsy.
Stars: ✭ 82 (+46.43%)
Mutual labels:  lstm, rnn
ACT
Alternative approach for Adaptive Computation Time in TensorFlow
Stars: ✭ 16 (-71.43%)
Mutual labels:  recurrent-neural-networks, rnn
dltf
Hands-on in-person workshop for Deep Learning with TensorFlow
Stars: ✭ 14 (-75%)
Mutual labels:  lstm, rnn

Citation

If you find our project helpful please cite our arxiv report below:

@misc{sharma2020speaker,
    title={Speaker Diarization: Using Recurrent Neural Networks},
    author={Vishal Sharma and Zekun Zhang and Zachary Neubert and Curtis Dyreson},
    year={2020},
    eprint={2006.05596},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

SpeakerDiarization

Speaker Diarization is the problem of separating speakers in an audio. There could be any number of speakers and final result should state when speaker starts and ends. In this project, we analyze given audio file with 2 channels and 2 speakers (on separate channel). We train Neural Network for learning when a person is speaking. We use different type of Neural Networks specifically, Single Layer Perceptron (SLP), Multi Layer Perceptron (MLP), Recurrent Neural Network (RNN) and Convolution Neural Network (CNN) we achieve 92% of accuracy with RNN.

Data

Data used in the process cannot be shared because of privacy concerns but if you need to test this code I can provide one sample data to try this code and test. Please email me for the sample data.

Dataset Description

Our dataset contains 37 audio files approximately of 15 minutes each with sampling rate of 44100 samples/second, recorded in 2 channels with exactly 2 speakers on 2 different microphones. Each audio file has been hand annotated for speakers timings. Annotating timing (in seconds) they start and stop speaking. We use this dataset and split in 3 parts for training, validation and testing.

Preprocessing

Data Normalization

We perform normalization of audio files after observing recorded audio was not in the same scale. Few audio files were louder than others and normalization can help bring all audio files to same scale.

Sampling Audio

With frame rate being high, we have a lot of data. To give an example, in a 15 min audio file we get about 40M samples in each channel. To reduce data without loosing much information, we down sample audio files by every 4 sample.

Cleaning Labels

Provided labels needed some cleaning described below:

  • Names of the speakers was not consistent throughout the data file, we cleaned it and made sure name is consistent.
  • File also contained unicode, which needed to be cleaned. Python goes crazy with unicodes lol
  • There were miss alignments as well in the data and needed to be removed and fixed.

Approach

Multi-layer Perceptron (MLP)

We start with a basic single layer perceptron model. We implement 3 different models with hidden layer of different sizes 100, 200, 500 neurons. We achieve approximately 86% accuracy. We next move to multi-layer perceptron model and try models with 2 layers deep. First layer had 100 and second 50 neurons and another with higher number of neurons (First Layer: 200, Second Layer: 100) (First Layer: 300, Second Layer: 50). For all the networks used in this project, the hidden neurons are ReLu \cite{relu} and the output neuron are sigmoid. The cost function used is cross entropy and mini-batch gradient descent with Adam optimization is used to train network.

Code for MLP is in file MLP_1201_2.py

Recurrent Neural Network (RNN)

Next we try Recurrent Neural Network on the classification problem. The RNN gives us the best result with 3 layers each with 150 Long short-term memory (LSTM) cells. The LSTM in the graph means a LSTM layer which consists of 150 LSTM cells. The output only has one neuron with sigmoid to predict 0 or 1.

LSTM Network Architecture

Convolution Neural Network (CNN)

To apply CNN, we at first compute the spectrogram for each row of the data matrix, then store them into a new file by using pickle. In this way we don’t need to compute spectrogram online and hence can save a lot of training time. Function scipy. signal.spectrogram is used to compute the spectrogram for each segment. The recomputed spectrogram of each segment then is organized to a 3 dimension matrix with shape (number of segments, height, width). For example, the down sampled data matrix of a channel returned by get data has the shape (100, 1102) for a channel with 100 segments, then the shape of recomputed spectrogram matrix is (100,129,4). The number of segments remains the same. The height 129 and width 4 come from using the default parameters of function scipy. signal.spectrogram. Spectrogram matrices are computed and stored by using code in Spectrogram Generator.

CNN Network Architecture

Results

MLP CNN RNN

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].