All Projects → lordmartian → deep_avsr

lordmartian / deep_avsr

Licence: MIT license
A PyTorch implementation of the Deep Audio-Visual Speech Recognition paper.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to deep avsr

kaldi-long-audio-alignment
Long audio alignment using Kaldi
Stars: ✭ 21 (-79.81%)
Mutual labels:  speech-recognition, automatic-speech-recognition, speech-to-text
demo vietasr
Vietnamese Speech Recognition
Stars: ✭ 22 (-78.85%)
Mutual labels:  speech-recognition, automatic-speech-recognition, speech-to-text
sova-asr
SOVA ASR (Automatic Speech Recognition)
Stars: ✭ 123 (+18.27%)
Mutual labels:  speech-recognition, automatic-speech-recognition, speech-to-text
leopard
On-device speech-to-text engine powered by deep learning
Stars: ✭ 354 (+240.38%)
Mutual labels:  speech-recognition, automatic-speech-recognition, speech-to-text
2018-dlsl
UPC Deep Learning for Speech and Language 2018
Stars: ✭ 18 (-82.69%)
Mutual labels:  speech-recognition, automatic-speech-recognition
rnnt decoder cuda
An efficient implementation of RNN-T Prefix Beam Search in C++/CUDA.
Stars: ✭ 60 (-42.31%)
Mutual labels:  speech-recognition, speech-to-text
Chinese-automatic-speech-recognition
Chinese speech recognition
Stars: ✭ 147 (+41.35%)
Mutual labels:  speech-recognition, speech-to-text
Unity live caption
Use Google Speech-to-Text API to do real-time live stream caption on Unity! Best when combined with your virtual character!
Stars: ✭ 26 (-75%)
Mutual labels:  speech-recognition, speech-to-text
speechmatics-python
Python library and CLI for Speechmatics
Stars: ✭ 24 (-76.92%)
Mutual labels:  speech-recognition, speech-to-text
DeepSpeech-API
The code enables users to use Mozilla's Deep Speech model over the Web Browser.
Stars: ✭ 31 (-70.19%)
Mutual labels:  speech-recognition, speech-to-text
spokestack-ios
Spokestack: give your iOS app a voice interface!
Stars: ✭ 27 (-74.04%)
Mutual labels:  speech-recognition, speech-to-text
speechrec
a simple speech recognition app using the Web Speech API Interfaces
Stars: ✭ 18 (-82.69%)
Mutual labels:  speech-recognition, speech-to-text
hf-experiments
Experiments with Hugging Face 🔬 🤗
Stars: ✭ 37 (-64.42%)
Mutual labels:  speech-recognition, automatic-speech-recognition
AmazonSpeechTranslator
End-to-end Solution for Speech Recognition, Text Translation, and Text-to-Speech for iOS using Amazon Translate and Amazon Polly as AWS Machine Learning managed services.
Stars: ✭ 50 (-51.92%)
Mutual labels:  speech-recognition, speech-to-text
web-speech-cognitive-services
Polyfill Web Speech API with Cognitive Services Bing Speech for both speech-to-text and text-to-speech service.
Stars: ✭ 35 (-66.35%)
Mutual labels:  speech-recognition, speech-to-text
Inimesed
An Android app that lets you search your contacts by voice. Internet not required. Based on Pocketsphinx. Uses Estonian acoustic models.
Stars: ✭ 65 (-37.5%)
Mutual labels:  speech-recognition, speech-to-text
scripty
Speech to text bot for Discord using Mozilla's DeepSpeech
Stars: ✭ 14 (-86.54%)
Mutual labels:  speech-recognition, speech-to-text
kaldi ag training
Docker image and scripts for training finetuned or completely personal Kaldi speech models. Particularly for use with kaldi-active-grammar.
Stars: ✭ 14 (-86.54%)
Mutual labels:  speech-recognition, speech-to-text
open-speech-corpora
💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies
Stars: ✭ 841 (+708.65%)
Mutual labels:  speech-recognition, speech-to-text
simple-obs-stt
Speech-to-text and keyboard input captions for OBS.
Stars: ✭ 89 (-14.42%)
Mutual labels:  speech-recognition, speech-to-text

Deep Audio-Visual Speech Recognition

The repository contains a PyTorch reproduction of the TM-CTC model from the Deep Audio-Visual Speech Recognition paper. We train three models - Audio-Only (AO), Video-Only (VO) and Audio-Visual (AV), on the LRS2 dataset for the speech-to-text transcription task.

Requirements

System packages:

ffmpeg==2.8.15
python==3.6.9

Python packages:

editdistance==0.5.3
matplotlib==3.1.1
numpy==1.18.1
opencv-python==4.2.0
pytorch==1.2.0
scipy==1.3.1
tqdm==4.42.1

CUDA 10.0 (if NVIDIA GPU is to be used):

cudatoolkit==10.0

Project Structure

The structure of the audio_only, video_only and audio_visual directories is as follows:

Directories

/checkpoints: Temporary directory to store intermediate model weights and plots while training. Gets automatically created.

/data: Directory containing the LRS2 Main and Pretrain dataset class definitions and other required data-related utility functions.

/final: Directory to store the final trained model weights and plots. If available, place the pre-trained model weights in the models subdirectory.

/models: Directory containing the class definitions for the models.

/utils: Directory containing function definitions for calculating CER/WER, greedy search/beam search decoders and preprocessing of data samples. Also contains functions to train and evaluate the model.

Files

checker.py: File containing checker/debug functions for testing all the modules and the functions in the project as well as any other checks to be performed.

config.py: File to set the configuration options and hyperparameter values.

demo.py: Python script for generating predictions with the specified trained model for all the data samples in the specified demo directory.

preprocess.py: Python script for preprocessing all the data samples in the dataset.

pretrain.py: Python script for pretraining the model on the pretrain set of the LRS2 dataset using curriculum learning.

test.py: Python script to test the trained model on the test set of the LRS2 dataset.

train.py: Python script to train the model on the train set of the LRS2 dataset.

Results

We provide Word Error Rate (WER) achieved by the models on the test set of the LRS2 dataset with both Greedy Search and Beam Search (with Language Model) decoding techniques. We have tested in cases of clean audio and noisy audio (0 dB SNR). We also give WER in cases where only one of the modalities is used in the Audio-Visual model.

Operation Mode AO/VO Model AV Model
Greedy Beam (+LM)
Greedy Beam (+LM)
Clean Audio
AO 11.4% 8.3% 12.0% 8.2%
VO 61.8% 55.3% 56.3% 49.2%
AV - - 10.3% 6.8%
Noisy Audio
AO 62.5% 54.0% 59.0% 50.7%
AV - - 29.1% 22.1%

Pre-trained Weights

Download the pre-trained weights for the Visual Frontend from here and for the Language Model from here. Once the Visual Frontend and Language Model weights are downloaded, place them in a folder and add their paths in the config.py file.

For the pre-trained weights of the AO, VO and AV models, please send an email at smeet.shah.c2020<AT>iitbombay.org from your institutional email ID. Place the weights of each model in the corresponding /final/models directory.

Note:

  • Replace <AT> with @ in the email ID while sending an email. It may take few days to reply back. Please have patience.
  • Do NOT open issues on GitHub requesting pre-trained weights. Such issues may be deleted. I am sharing weights only via replies to email requests from valid institutional email IDs.
  • The link for these weights has also been updated. If required, please resend an email for new link.

How To Use

If planning to train the models, download the complete LRS2 dataset from here or in cases of custom datasets, have the specifications and folder structure similar to LRS2 dataset.

Steps have been provided to either train the models or to use the trained models directly for inference:

Training

Set the configuration options in the config.py file before each of the following steps as required. Comments have been provided for each option. Also, check the Training Details section below as a guide for training the models from scratch.

  1. Run the preprocess.py script to preprocess and generate the required files for each sample.

  2. Run the pretrain.py script for one iteration of curriculum learning. Run it multiple times, each time changing the PRETRAIN_NUM_WORDS argument in the config.py file to perform multiple iterations of curriculum learning.

  3. Run the train.py script to finally train the model on the train set.

  4. Once the model is trained, run the test.py script to obtain the performance of the trained model on the test set.

  5. Run the demo.py script to use the model to make predictions for each sample in a demo directory. Read the specifications for the sample in the demo.py file.

Inference

  1. Set the configuration options in the config.py file. Comments have been provided for each option.

  2. Run the demo.py script to use the model to make predictions for each sample in a demo directory. Read the specifications for the sample in the demo.py file.

Important Training Details

  • We perform iterations of Curriculum Learning by changing the PRETRAIN_NUM_WORDS config option. The number of words used in each iteration of curriculum learning is as follows: 1,2,3,5,7,9,13,17,21,29,37, i.e., 11 iterations in total.

  • During Curriculum Learning, the minibatch size (default=32) is reduced by half each time we hit an Out Of Memory error.

  • In each iteration, the training is terminated forcefully once the validation set WER flattens. We also make sure that the Learning Rate has decreased to the minimum value before terminating the training.

  • We train the AO and VO models first. We then initialize the AV model with weights from the trained AO and VO models as follows: AO Audio Encoder → AV Audio Encoder, VO Video Encoder → AV Video Encoder, VO Video Decoder → AV Joint Decoder.

  • The weights of the Audio and Video Encoders are fixed during AV model pretraining. The complete AV model is trained on the train set after the pretraining is complete.

  • We have used a GPU with 11 GB memory for our training. Each model took around 7 days for complete training.

References

  1. The pre-trained weights of the Visual Frontend and the Language Model have been obtained from Afouras T. and Chung J, Deep Lip Reading: a comparison of models and an online application, 2018 GitHub repository.

  2. The CTC beam search implementation is adapted from Harald Scheidl, CTC Decoding Algorithms GitHub repository.


PS: Please do not hesitate to raise an issue in case of any bugs/doubts/suggestions. However, please be patient if it takes some time to reply. Happy Open Source-ing !! 😃

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].