All Projects → AlexGidiotis → Multimodal-Gesture-Recognition-with-LSTMs-and-CTC

AlexGidiotis / Multimodal-Gesture-Recognition-with-LSTMs-and-CTC

Licence: MIT license
An end-to-end system that performs temporal recognition of gesture sequences using speech and skeletal input. The model combines three networks with a CTC output layer that recognises gestures from continuous stream.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Multimodal-Gesture-Recognition-with-LSTMs-and-CTC

Kerasdeepspeech
A Keras CTC implementation of Baidu's DeepSpeech for model experimentation
Stars: ✭ 245 (+880%)
Mutual labels:  speech, ctc
torch-asg
Auto Segmentation Criterion (ASG) implemented in pytorch
Stars: ✭ 42 (+68%)
Mutual labels:  speech, ctc
Neural sp
End-to-end ASR/LM implementation with PyTorch
Stars: ✭ 408 (+1532%)
Mutual labels:  speech, ctc
speech recognition ctc
Use ctc to do chinese speech recognition by keras / 通过keras和ctc实现中文语音识别
Stars: ✭ 40 (+60%)
Mutual labels:  speech, ctc
Pytorch Asr
ASR with PyTorch
Stars: ✭ 124 (+396%)
Mutual labels:  speech, ctc
Volute
Raspberry Pi + Nodejs = Speech Robot
Stars: ✭ 224 (+796%)
Mutual labels:  speech
Wavegrad
Implementation of Google Brain's WaveGrad high-fidelity vocoder (paper: https://arxiv.org/pdf/2009.00713.pdf). First implementation on GitHub.
Stars: ✭ 245 (+880%)
Mutual labels:  speech
Speech Enhancement
Deep learning for audio denoising
Stars: ✭ 207 (+728%)
Mutual labels:  speech
Naver-AI-Hackathon-Speech
2019 Clova AI Hackathon : Speech - Rank 12 / Team Kai.Lib
Stars: ✭ 26 (+4%)
Mutual labels:  speech
Speechbrain.github.io
The SpeechBrain project aims to build a novel speech toolkit fully based on PyTorch. With SpeechBrain users can easily create speech processing systems, ranging from speech recognition (both HMM/DNN and end-to-end), speaker recognition, speech enhancement, speech separation, multi-microphone speech processing, and many others.
Stars: ✭ 242 (+868%)
Mutual labels:  speech
Neural Voice Cloning With Few Samples
Implementation of Neural Voice Cloning with Few Samples Research Paper by Baidu
Stars: ✭ 211 (+744%)
Mutual labels:  speech
Source separation
Deep learning based speech source separation using Pytorch
Stars: ✭ 226 (+804%)
Mutual labels:  speech
Voice Gender
Gender recognition by voice and speech analysis
Stars: ✭ 248 (+892%)
Mutual labels:  speech
Speech Denoiser
A speech denoise lv2 plugin based on RNNoise library
Stars: ✭ 220 (+780%)
Mutual labels:  speech
idear
🎙️ Handsfree Audio Development Interface
Stars: ✭ 84 (+236%)
Mutual labels:  speech
Tts Cube
End-2-end speech synthesis with recurrent neural networks
Stars: ✭ 213 (+752%)
Mutual labels:  speech
Tacotron pytorch
PyTorch implementation of Tacotron speech synthesis model.
Stars: ✭ 242 (+868%)
Mutual labels:  speech
browser-apis
🦄 Cool & Fun Browser Web APIs 🥳
Stars: ✭ 21 (-16%)
Mutual labels:  speech
Lhotse
Stars: ✭ 236 (+844%)
Mutual labels:  speech
Gcc Nmf
Real-time GCC-NMF Blind Speech Separation and Enhancement
Stars: ✭ 231 (+824%)
Mutual labels:  speech

Multimodal-Gesture-Recognition-with-LSTMs-and-CTC

This repository contains code for my diploma thesis MULTIMODAL GESTURE RECOGNITION WITH THE USE OF DEEP LEARNING.

Overview

An end-to-end system that performs temporal recognition of gesture sequences using speech and skeletal input. The model combines two LSTM networks with a CTC output layer that spot and classify gestures from two continuous streams.

The basic modules of the model are two bidirectional LSTMs. The first extracts features from speech and the second from skeletal data. Then another bidirectional LSTM combines the uni-modal features and performs the gesture recognition.

Here we provide code for:

a) A BLSTM network for speech recognition.

b) A BLSTM network for skeletal recognition.

c) A BLSTM network that fuses the two uni-modal networks.

d) An implementation of the CTC loss output.

e) Decoders for the different networks.

f) Sample code for skeletal and speech feature extraction.

We used keras and tensorflow to build our model.

This project was built for the ChaLearn 2013 dataset. We trained and tested the model using the dataset of the challenge. The data can be downloaded here. http://sunai.uoc.edu/chalearn/#tabs-2

This model achieves 94% accuracy on the test set of the ChaLearn 2013 challenge.

Usage

In order to train the models provided here you need to preprocess the data:

  1. MFCC features need to be extracted from the audio .wav files. We used 13 MFCC features as well as the first and second order derivatives (total 39 features). We used the HTK toolkit to extract the features. Here we just provide the configuration file for HCopy (the feature extraction tool for HTK). If you want to use HTK for this purpose you can find it here http://htk.eng.cam.ac.uk/

  2. Once the MFCC features are extracted just put the training data all in one big csv file along with the labels (same for the validation and test data) and you are ready to train the speech lstm network.

  3. For the skeletal features you should provide the joint positions for each file in one csv file each and run the following scripts:

    a) extract_activity_feats.py

    b) gather_skeletal.py

    c) skeletal_feature_extraction.py

  4. Run util/mix_data.py to mix some of the dev data to the training set.

  5. Now you are ready to train the skeletal lstm network.

  6. Once both networks are trained you can train the multimodal fusion network.

  7. Use the sequence_decoding.py script to evaluate the trained model with test data.

The training of the complete system takes approximately 100 hours in an nvidia 1060 gtx.

Requirements

Run pip install -r requirements.txt to install the requirements.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].