All Projects → georgesterpu → avsr-tf1

georgesterpu / avsr-tf1

Licence: GPL-3.0 license
Audio-Visual Speech Recognition using Sequence to Sequence Models

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects
javascript
184084 projects - #8 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to avsr-tf1

Nmtpytorch
Sequence-to-Sequence Framework in PyTorch
Stars: ✭ 392 (+415.79%)
Mutual labels:  seq2seq, asr
kospeech
Open-Source Toolkit for End-to-End Korean Automatic Speech Recognition leveraging PyTorch and Hydra.
Stars: ✭ 456 (+500%)
Mutual labels:  seq2seq, asr
torch-asg
Auto Segmentation Criterion (ASG) implemented in pytorch
Stars: ✭ 42 (-44.74%)
Mutual labels:  seq2seq, asr
Neural sp
End-to-end ASR/LM implementation with PyTorch
Stars: ✭ 408 (+436.84%)
Mutual labels:  seq2seq, asr
Asr
Stars: ✭ 54 (-28.95%)
Mutual labels:  seq2seq, asr
Kospeech
Open-Source Toolkit for End-to-End Korean Automatic Speech Recognition.
Stars: ✭ 190 (+150%)
Mutual labels:  seq2seq, asr
Delta
DELTA is a deep learning based natural language and speech processing platform.
Stars: ✭ 1,479 (+1846.05%)
Mutual labels:  seq2seq, asr
Lingvo
Lingvo
Stars: ✭ 2,361 (+3006.58%)
Mutual labels:  seq2seq, asr
leopard
On-device speech-to-text engine powered by deep learning
Stars: ✭ 354 (+365.79%)
Mutual labels:  asr
seq2seq-autoencoder
Theano implementation of Sequence-to-Sequence Autoencoder
Stars: ✭ 12 (-84.21%)
Mutual labels:  seq2seq
megs
A merged version of multiple open-source German speech datasets.
Stars: ✭ 21 (-72.37%)
Mutual labels:  asr
pie
百度云流式语音识别客户端 SDK
Stars: ✭ 62 (-18.42%)
Mutual labels:  asr
Speech-Corpus-Collection
A Collection of Speech Corpus for ASR and TTS
Stars: ✭ 113 (+48.68%)
Mutual labels:  asr
Seq2Seq-chatbot
TensorFlow Implementation of Twitter Chatbot
Stars: ✭ 18 (-76.32%)
Mutual labels:  seq2seq
NeuralCodeTranslator
Neural Code Translator provides instructions, datasets, and a deep learning infrastructure (based on seq2seq) that aims at learning code transformations
Stars: ✭ 32 (-57.89%)
Mutual labels:  seq2seq
Naver-AI-Hackathon-Speech
2019 Clova AI Hackathon : Speech - Rank 12 / Team Kai.Lib
Stars: ✭ 26 (-65.79%)
Mutual labels:  seq2seq
rasr
The RWTH ASR Toolkit.
Stars: ✭ 43 (-43.42%)
Mutual labels:  asr
opensource-voice-tools
A repo listing known open source voice tools, ordered by where they sit in the voice stack
Stars: ✭ 21 (-72.37%)
Mutual labels:  asr
react-native-spokestack
Spokestack: give your React Native app a voice interface!
Stars: ✭ 53 (-30.26%)
Mutual labels:  asr
tensorflow-chatbot-chinese
網頁聊天機器人 | tensorflow implementation of seq2seq model with bahdanau attention and Word2Vec pretrained embedding
Stars: ✭ 50 (-34.21%)
Mutual labels:  seq2seq

AVSR-tf1

Audio-Visual Speech Recognition (AVSR) research system using sequence-to-sequence neural networks based on TensorFlow 1.13

About

AVSR-tf1 is an open-source research system for Speech Recognition.

Written entirely in Python, AVSR-tf1 aims to provide a simple and reproducible way of training and evaluating speech recognition models based on sequence to sequence neural networks. AVSR-tf1 can exploit both auditory and visual speech modalities, considered either independently (ASR, VSR) or jointly (AVSR).

Rather than providing a dense documentation to the users and contributors, the AVSR-tf1 code is designed (or strives) to be intuitive and self-explanatory, encouraging researchers and developers to understand the entire codebase and propose improvements at its lowest levels. Hence we want it to be more of a flexible research system than a black box for production.

Core functionalities

1. Extract acoustic features from audio files (librosa, TensorFlow)

  • log mel-scale spectrograms, MFCC
  • optional computation of first and second derivatives
  • optional strided frame stacking
  • write into TensorFlow-compatible format (TFRecord dataset)

2. Extract the lip region from video files (OpenFace - Tadas Baltrusaitis)

  • write into TensorFlow-compatible format (TFRecord dataset)

3. Train sequence to sequence neural networks for continuous speech recognition

  • audio-only (LAS [3])
  • visual-only (lip-reading [5])
  • audio-visual fusion
    • dual-attention decoder (WLAS [4])
    • attention-based alignment (AV-Align [6, 7])
  • flexible language units (phonemes, visemes, characters etc.)

4. Evaluate models

  • normalised Levenshtein distances
    • Character Error Rate
    • Word Error Rate

Getting started

A typical workflow is as follows:

  1. convert data into .tfrecord files
  2. train/evaluate models

Please refer to the attached examples for running audio-only, visual-only, or audio-visual speech recognition experiments.

To prepare the data, you can use the two scripts extract_faces.py and write_records_tcd.py

Dependencies

For visual/audio-visual experiments, please compile from source install OpenFace

The other dependencies are popular and easy to install Python packages, so feel free to use your preferred sources.

The supported TensorFlow version for this repository is 1.13.1, and the recommended install source is: pip install tensorflow_gpu==1.13.1.

Please get in touch in case you face any issues.

Acknowledgements

We are grateful to Eugene Brevdo of Google for his remarkable help and advice during the early stages of development. In addition, we would like to thank Derek Murray, Andreas Steiner, Khe Chai Sim for the assistance and interesting conversations, and also every TensorFlow contributor on GitHub and StackOverflow. Our work is supported by NVIDIA, which granted us a Titan Xp GPU through its academic program.

How to cite

If you use this work, please cite it as:

George Sterpu, Christian Saam. Naomi Harte. How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020. https://doi.org/10.1109/TASLP.2020.2980436

[bib]

@ARTICLE{Sterpu2020,
  author={G. {Sterpu} and C. {Saam} and N. {Harte}},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  title={How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition},
  year={2020},
  volume={},
  number={},
 pages={1-1},
}

[pdf]

or

George Sterpu, Christian Saam, and Naomi Harte. 2018. Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition. In 2018 International Conference on Multimodal Interaction (ICMI ’18), October 16–20, 2018, Boulder, CO, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3242969.3243014

[bib]

@inproceedings{sterpu_icmi18,
  author = {George Sterpu and Christian Saam and Naomi Harte},
  title = {Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition},
  year = {2018},
  publisher = {{ACM, New York, NY, USA}},
  booktitle = {2018 International Conference on Multimodal Interaction (ICMI '18), October 16--20, 2018, Boulder, CO, USA},
  url       = {http://doi.acm.org/10.1145/3242969.3243014},
  doi       = {10.1145/3242969.3243014},
}

[pdf]

How to contribute

We are delighted to receive your feedback and help on improving AVSR-tf1. On the technical side, this could be an advice or a pull request for code refactoring (we are not Python/TensorFlow experts), adding implementations of popular features, bug reports, performance improvements, language models, support for computation in 16 bit precision or on Google TPU devices.

References

[1] Sequence to Sequence Learning with Neural Networks https://arxiv.org/abs/1409.3215

[2] Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473

[3] Listen, Attend and Spell https://arxiv.org/abs/1508.01211

[4] Lip Reading Sentences in the Wild https://arxiv.org/abs/1611.05358

[5] Can DNNs Learn to Lipread Full Sentences? https://arxiv.org/abs/1805.11685

[6] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition https://arxiv.org/abs/1809.01728

[7] How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition https://ieeexplore.ieee.org/document/9035650

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].