All Projects → astorfi → Lip Reading Deeplearning

astorfi / Lip Reading Deeplearning

Licence: apache-2.0
🔓 Lip Reading - Cross Audio-Visual Recognition using 3D Architectures

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Lip Reading Deeplearning

Self Supervised Speech Recognition
speech to text with self-supervised learning based on wav2vec 2.0 framework
Stars: ✭ 106 (-93.54%)
Mutual labels:  speech-recognition
Ml Road
Machine Learning Resources, Practice and Research
Stars: ✭ 1,776 (+8.23%)
Mutual labels:  speech-recognition
Project alias
Alias is a teachable “parasite” that is designed to give users more control over their smart assistants, both when it comes to customisation and privacy. Through a simple app the user can train Alias to react on a custom wake-word/sound, and once trained, Alias can take control over your home assistant by activating it for you.
Stars: ✭ 1,577 (-3.9%)
Mutual labels:  speech-recognition
Pansori
Tools for ASR Corpus Generation from Online Video
Stars: ✭ 106 (-93.54%)
Mutual labels:  speech-recognition
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+3296.83%)
Mutual labels:  speech-recognition
Rnn Transducer
MXNet implementation of RNN Transducer (Graves 2012): Sequence Transduction with Recurrent Neural Networks
Stars: ✭ 114 (-93.05%)
Mutual labels:  speech-recognition
Delta
DELTA is a deep learning based natural language and speech processing platform.
Stars: ✭ 1,479 (-9.87%)
Mutual labels:  speech-recognition
Chinese Speech To Text
Chinese Speech To Text Using Wavenet
Stars: ✭ 124 (-92.44%)
Mutual labels:  speech-recognition
Kalliope
Kalliope is a framework that will help you to create your own personal assistant.
Stars: ✭ 1,509 (-8.04%)
Mutual labels:  speech-recognition
Sounder
An intent recognizing algorithm to predict the intent of a given text.
Stars: ✭ 118 (-92.81%)
Mutual labels:  speech-recognition
E2e Asr
PyTorch Implementations for End-to-End Automatic Speech Recognition
Stars: ✭ 106 (-93.54%)
Mutual labels:  speech-recognition
Deepspeechrecognition
A Chinese Deep Speech Recognition System 包括基于深度学习的声学模型和基于深度学习的语言模型
Stars: ✭ 1,421 (-13.41%)
Mutual labels:  speech-recognition
Holobot
HoloBot is a reusable 3D interface that allows HoloLens & VR users to interact with any bot using Mixed Reality & Speech.
Stars: ✭ 114 (-93.05%)
Mutual labels:  speech-recognition
Bigcidian
Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.
Stars: ✭ 99 (-93.97%)
Mutual labels:  speech-recognition
Wer are we
Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.
Stars: ✭ 1,684 (+2.62%)
Mutual labels:  speech-recognition
Ios ml
List of Machine Learning, AI, NLP solutions for iOS. The most recent version of this article can be found on my blog.
Stars: ✭ 1,409 (-14.14%)
Mutual labels:  speech-recognition
Kontinuousspeechrecognizer
A Kotlin Speech Recognizer that runs continuously and is triggered with an activation keyword
Stars: ✭ 113 (-93.11%)
Mutual labels:  speech-recognition
Keras Kaldi
Keras Interface for Kaldi ASR
Stars: ✭ 124 (-92.44%)
Mutual labels:  speech-recognition
Pytorch Asr
ASR with PyTorch
Stars: ✭ 124 (-92.44%)
Mutual labels:  speech-recognition
Nonautoreggenprogress
Tracking the progress in non-autoregressive generation (translation, transcription, etc.)
Stars: ✭ 118 (-92.81%)
Mutual labels:  speech-recognition

Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks - Official Project Page

https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat https://badges.frapsoft.com/os/v2/open-source.svg?v=102 https://coveralls.io/repos/github/astorfi/3D-convolutional-Audio-Visual/badge.svg?branch=master https://img.shields.io/twitter/follow/amirsinatorfi.svg?label=Follow&style=social

This repository contains the code developed by TensorFlow for the following paper:

im1 im2 im3

The input pipeline must be prepared by the users. This code is aimed to provide the implementation for Coupled 3D Convolutional Neural Networks for audio-visual matching. Lip-reading can be a specific application for this work.

If you used this code, please kindly consider citing the following paper:

@article{torfi20173d,
  title={3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition},
  author={Torfi, Amirsina and Iranmanesh, Seyed Mehdi and Nasrabadi, Nasser and Dawson, Jeremy},
  journal={IEEE Access},
  year={2017},
  publisher={IEEE}
  }

Table of Contents

DEMO

Training/Evaluation DEMO

training

Lip Tracking DEMO

liptrackingdemo

General View

Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information.

The Problem and the Approach

The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We proposed the utilization of a coupled 3D Convolutional Neural Network (CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features.

How to leverage 3D Convolutional Neural Networks?

The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller dataset, our proposed method surpasses the performance of the existing similar methods for audio-visual matching which use CNNs for feature representation. We also demonstrate that effective pair selection method can significantly increase the performance.

Code Implementation

The input pipeline must be provided by the user. The rest of the implementation consider the dataset which contains the utterance-based extracted features.

Lip Tracking

For lip tracking, the desired video must be fed as the input. At first, cd to the corresponding directory:

cd code/lip_tracking

The run the dedicated python file as below:

python VisualizeLip.py --input input_video_file_name.ext --output output_video_file_name.ext

Running the aforementioned script extracts the lip motions by saving the mouth area of each frame and create the output video with a rectangular around the mouth area for better visualization.

The required arguments are defined by the following python script which have been defined in the VisualizeLip.py file:

ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True,
             help="path to input video file")
ap.add_argument("-o", "--output", required=True,
             help="path to output video file")
ap.add_argument("-f", "--fps", type=int, default=30,
             help="FPS of output video")
ap.add_argument("-c", "--codec", type=str, default="MJPG",
             help="codec of output video")
args = vars(ap.parse_args())

Some of the defined arguments have their default values and no further action is required by them.

Processing

In the visual section, the videos are post-processed to have an equal frame rate of 30 f/s. Then, face tracking and mouth area extraction are performed on the videos using the dlib library [dlib]. Finally, all mouth areas are resized to have the same size and concatenated to form the input feature cube. The dataset does not contain any audio files. The audio files are extracted from videos using FFmpeg framework [ffmpeg]. The processing pipeline is the below figure.

readme_images/processing.gif

Input Pipeline for this work

The proposed architecture utilizes two non-identical ConvNets which uses a pair of speech and video streams. The network input is a pair of features that represent lip movement and speech features extracted from 0.3 second of a video clip. The main task is to determine if a stream of audio corresponds with a lip motion clip within the desired stream duration. In the two next sub-sections, we are going to explain the inputs for speech and visual streams.

Speech Net

On the time axis, the temporal features are non-overlapping 20ms windows which are used for the generation of spectrum features that possess a local characteristic. The input speech feature map, which is represented as an image cube, corresponds to the spectrogram as well as the first and second order derivatives of the MFEC features. These three channels correspond to the image depth. Collectively from a 0.3 second clip, 15 temporal feature sets (each forms 40 MFEC features) can be derived which form a speech feature cube. Each input feature map for a single audio stream has the dimensionality of 15 × 40 × 3. This representation is depicted in the following figure:

readme_images/Speech_GIF.gif

The speech features have been extracted using [SpeechPy] package.

Please refer to code/speech_input/input_feature.py for having an idea about how the input pipeline works.

Visual Net

The frame rate of each video clip used in this effort is 30 f/s. Consequently, 9 successive image frames form the 0.3 second visual stream. The input of the visual stream of the network is a cube of size 9x60x100, where 9 is the number of frames that represent the temporal information. Each channel is a 60x100 gray-scale image of mouth region.

readme_images/lip_motion.jpg

Architecture

The architecture is a coupled 3D convolutional neural network in which two different networks with different sets of weights must be trained. For the visual network, the lip motions spatial information alongside the temporal information are incorporated jointly and will be fused for exploiting the temporal correlation. For the audio network, the extracted energy features are considered as a spatial dimension, and the stacked audio frames form the temporal dimension. In the proposed 3D CNN architecture, the convolutional operations are performed on successive temporal frames for both audio-visual streams.

readme_images/DNN-Coupled.png

Training / Evaluation

At first, clone the repository. Then, cd to the dedicated directory:

cd code/training_evaluation

Finally, the train.py file must be executed:

python train.py

For evaluation phase, a similar script must be executed:

python test.py

Results

The below results demonstrate effects of the proposed method on the accuracy and the speed of convergence.

accuracy

The best results, which is the right-most one, belongs to our proposed method.

converge

The effect of proposed Online Pair Selection method has been shown in the figure.

Disclaimer

The current version of the code does not contain the adaptive pair selection method proposed by 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition paper. Just a simple pair selection with hard thresholding is included at the moment.

Contribution

We are looking forward to your kind feedback. Please help us to improve the code and make our work better. For contribution, please create the pull request and we will investigate it promptly. Once again, we appreciate your feedback and code inspections.

references

[SpeechPy]@misc{amirsina_torfi_2017_810392, author = {Amirsina Torfi}, title = {astorfi/speech_feature_extraction: SpeechPy}, month = jun, year = 2017, doi = {10.5281/zenodo.810392}, url = {https://doi.org/10.5281/zenodo.810391}}
[dlib]
    1. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.
[ffmpeg]
  1. Developers. FFmpeg tool (version be1d324) [software], 2016.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].