Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Stars: ✭ 2,097 (+4776.74%)

Mutual labels: lstm, speech-recognition

Audio Pretrained Model

A collection of Audio and Speech pre-trained models.

Stars: ✭ 61 (+41.86%)

Mutual labels: speech-recognition, audio

Rnn ctc

Recurrent Neural Network and Long Short Term Memory (LSTM) with Connectionist Temporal Classification implemented in Theano. Includes a Toy training example.

Stars: ✭ 220 (+411.63%)

Mutual labels: lstm, speech-recognition

Keras Sincnet

Keras (tensorflow) implementation of SincNet (Mirco Ravanelli, Yoshua Bengio - https://github.com/mravanelli/SincNet)

Stars: ✭ 47 (+9.3%)

Mutual labels: speech-recognition, audio

Speech-Recognition

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

Stars: ✭ 21 (-51.16%)

Mutual labels: lstm, speech-recognition

Rus-SpeechRecognition-LSTM-CTC-VoxForge

Распознавание речи русского языка используя Tensorflow, обучаясь на базе Voxforge

Stars: ✭ 50 (+16.28%)

Mutual labels: lstm, speech-recognition

Free Spoken Digit Dataset

A free audio dataset of spoken digits. Think MNIST for audio.

Stars: ✭ 396 (+820.93%)

Mutual labels: speech-recognition, audio

Speech recognition

Speech recognition module for Python, supporting several engines and APIs, online and offline.

Stars: ✭ 5,999 (+13851.16%)

Mutual labels: speech-recognition, audio

Vad

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.

Stars: ✭ 622 (+1346.51%)

Mutual labels: lstm, speech-recognition

Sincnet

SincNet is a neural architecture for efficiently processing raw audio samples.

Stars: ✭ 764 (+1676.74%)

Mutual labels: speech-recognition, audio

Swiftysound

SwiftySound is a simple library that lets you play sounds with a single line of code.

Stars: ✭ 995 (+2213.95%)

Mutual labels: audio

Char Rnn Keras

TensorFlow implementation of multi-layer recurrent neural networks for training and sampling from texts

Stars: ✭ 40 (-6.98%)

Mutual labels: lstm

Audioutils

🎶 Audioutils-音频录制和音频播放工具

Stars: ✭ 38 (-11.63%)

Mutual labels: audio

Vchsm

C++ 11 algorithm implementation for voice conversion using harmonic plus stochastic models

Stars: ✭ 38 (-11.63%)

Mutual labels: audio

Cpal

Cross-platform audio I/O library in pure Rust

Stars: ✭ 1,001 (+2227.91%)

Mutual labels: audio

Pncc

A implementation of Power Normalized Cepstral Coefficients: PNCC

Stars: ✭ 40 (-6.98%)

Mutual labels: speech-recognition

View All Similar Projects ➔

Audio and Visual Speech Recognition(AVSR) using Deep Learning

This is my Google Summer of Code 2017 Project with the Distributed Little Red Hen Lab.

The aim of this project is to develop a working Speech to Text module for the Red Hen Lab’s current Audio processing pipeline. The initial goal is to extend current Deep Speech model(audio only) to Red Hen lab's TV news videos datasets.

Now, it is common for news videos to incorporate both auditory and visual modalities. Developing a multi-modal Speech to Text model seems very tempting for these datasets. The next goal is to develop a multi-modal Speech to Text system (AVSR) by extracting visual modalities and concatenating them to the previous inputs.

This project is based on the approach discussed in paper Deep Speech. This paper discusses speech recognition using audio modality only, hence this project can be seen as an extension to Deep Speech model.

Getting Started
Data-Preprocessing for Training
- Audio-only Speech Recognition
- Audio Visual Speech Recognition (AVSR)
Training
- Audio-only Model
- Audi-Visual Model (AVSR)
Checkpointing
Some Training Results
Exporting model and Testing
- Audio-only Model
- Audio-Visual Model (AVSR)
Running code at Case HPC
Acknowledgments

Getting Started

Prerequisites

For Audio-only Speech Recognition:
For Audio-Visual Speech Recognition:

In addition to above requirements, you will also require:

Installing

Firstly, install Git Large File Storage(LFS) Support and FFmpeg.
For video based speech recognition (lip reading), you will also require OpenCV 3.x and dlib for python.

Open terminal and type following commands.

 $ git clone https://github.com/pandeydivesh15/AVSR-Deep-Speech.git
 $ cd AVSR-Deep-Speech
 $ pip install -r requirements.txt

Data-Preprocessing for Training

Please note that these data-preprocessing steps are only required if your training audio/video files are quite long (> 1 min). If you have access to shorter wav files (length in secs) and their associated transcripts, you will not require any data-preprocessing (you must have CSV files too, see bin/import_ldc93s1.py for downloading one example dataset). In case you have longer audio/video files, it is suggested to use data-preprocessing.

These steps require videos/audios and their associated time-aligned transcripts. Time aligned time stamps for your audios/videos can be found using Gentle or Red Hen Lab's Audio Pipeline or any other alignment application.

Store time-aligned timescripts as json files. The json file should be of the format: Click here.

Note: By default, the project assumes that all .mp4(video) files are kept at data/RHL_mp4, json files at data/RHL_json and all wav files at data/RHL_wav. If you would like to change the defaults, change the associated variables at bin/preprocess_data_audio.py.

Audio-only Speech Recognition

bin/preprocess_data_audio.py expects 5 positional arguments.

Argument	Description
output_dir_train	Output dir for storing training files (with trailing slash)
output_dir_dev	Output dir for storing files for validation (with trailing slash)
output_dir_test	Output dir for storing test files (with a trailing slash)
train_split	A float value for deciding percentage of data split for training the model
dev_split	A float value for deciding percentage of validation data
test_split	A float value for deciding percentage of test data

Have a look at bin/preprocess_audio.sh, for a sample usage. This script runs bin/preprocess_data_audio.py with default storage locations and default data split percentages.

From the main project's directory, open terminal and type:

$ ./bin/preprocess_audio.sh

After this step, all prepared data files(train, dev, test) will be stored in data/clean_data folder.

Audio Visual Speech Recognition (AVSR)

Preparing data for training Autoencoder

bin/preprocess_auto_enc.py expects 2 necessary positional args and 2 optional args.

Pos. Arguments	Description
video_dir	Output dir for storing training files (with trailing slash)
output_dir	Output dir for storing files for validation (with trailing slash)

Optional	Description
--max_videos n	`n` = number of videos to be used for preprocessing. (Default = 1)
--screen_display	Determines whether to display the video being processed.

Have a look at bin/preprocessing_AE.sh. This script runs bin/preprocessing_AE.sh with default values.

From the main project's directory, open terminal and type:

$ ./bin/preprocessing_AE.sh

Training RBMs and Autoencoder

The script bin/AE_training.py deals with training RBMs and autoencoder. It first trains RBMs and then using their weights, trains the main autoencoder.

The bash script bin/run_AE_training.sh runs the bin/AE_training.py using default settings.

$ ./bin/run_AE_training.sh

Preparing data for AVSR

Command line arguments for bin/preprocess_audio_video.py are same as in the case of audio-only speech recognition. Have a look at bin/preprocess_AVSR.sh for example usage. Open terminal and change directory to main project's directory. Type:

$ ./bin/preprocess_AVSR.sh

After this step, there will be three kinds of file formats for each file name. For e.g. if file's name is 'xyz', there will be 'xyz.wav'(containing actual audio), 'xyz.json'(containing visual features), and 'xyz.txt'(transcript).

Training

Original DeepSpeech

The original Deep Speech model, provided many command line options. To view those options, directly open the main script or you can also type:

$ ./DeepSpeech.py --help

To run the original Deep Speech code, with a sample dataset (called LDC93S1) and default settings, run:

$ ./bin/run-ldc93s1.sh

This script first installs the LDC93S1 dataset at data/ldc93s1/. Afterward, it runs DeepSpeech.py. It trains on LDC93S1 dataset, outputs stats for each epoch, and finally outputs WER report for any dev or test data.

DeepSpeech_RHL.py

Any code modifications for Red Hen Lab will be reflected in DeepSpeech_RHL.py. One such modification is that DeepSpeech_RHL.py allows transcripts to have digits[0-9] too, unlike original DeepSpeech.py.

To run modified DeepSpeech on your system (with default settings), open terminal and run:

$ ./bin/run_case_HPC.sh

# This script trains on your data (placed at data/clean_data/), 
# and finally exports model at data/export/.

$ ./bin/run-ldc93s1_RHL.sh

# This script runs on LDC93S1 dataset. It doesn't exports any model.

DeepSpeech_RHL_AVSR.py

This script deals with audio-visual speech recognition. Before running this script, make sure that you have all prepared data at data/clean_data/ dir. See this for more details.

To train AVSR model (using data placed at data/clean_data/), open terminal and run:

$ ./bin/run_case_HPC_AVSR.sh

Note: Feel free to modify any of the above scripts for your use.

Checkpointing

During training of a model so called checkpoints will get stored on disk. This takes place at a configurable time interval. The purpose of checkpoints is to allow interruption (also in case of some unexpected failure) and later continuation of training without loosing hours of training time. Resuming from checkpoints happens automatically by just (re)starting training with the same --checkpoint_dir of the former run.

Be aware however that checkpoints are only valid for the same model geometry they had been generated from. In other words: If there are error messages of certain Tensors having incompatible dimensions, this is most likely due to an incompatible model change. One usual way out would be to wipe all checkpoint files in the checkpoint directory or changing it before starting the training.

Some Training Results

Audio-only Speech Recognition

Here are some of the results I obtained while running the code at CWRU HPC. The script bin/run_case_HPC.sh was used to get these results.

These results are based on a one hour long audio file. The file was split into 634 .wav files (See Data-Preprocessing). 90% files were used for training and 5% each for validation and testing.

Variable Dropouts for feedforward layers

dropout_rate = 0.05

dropout_rate = 0.10

Exporting model and Testing

If the --export_dir parameter is provided to DeepSpeech_RHL.py, a model will have been exported to this directory during training. This trained exported model can then be used for predicting transcripts for new audio/video files.

There are two scripts:

For audio-only model: run_exported_model_audio.py
For AVSR, run_exported_model_AVSR.py

Both of them expect following args:

Argument	Description
-d, --export_dir	Dir where the trained model's meta graph and data were exported
-n, --model_name	Name of the model exported
-af, --wav_file	(Only for audio-only model) Wav file's location.
-vf, --video_file	Video file's location. For audio only model, if --wav_file given, this option will have no effect.

Options	Description
--use_spell_check	Decide whether to use spell check system for decoded transcripts from RNN. If option is given, spell correction system (KenLM) will be used.

Usage examples:

Audio only Speech Model

For running an exported model with default settings, run:
```
 $ python ./bin/run_exported_model_audio.py
```
Note: This script, by default, runs on data/ldc93s1/LDC93S1.wav file. In case you dont have LDC93S1 dataset downloaded, run: $ python -u bin/import_ldc93s1.py ./data/ldc93s1

Using command line options for running exported model:

Finding transcript for audio/video file using audio-only model:

 $ python ./bin/run_exported_model_audio.py -d path_to_exported_model/ -n model_name -af /path_to_wav_file/file.wav

 $ python ./bin/run_exported_model_audio.py -d path_to_exported_model/ -n model_name -vf /path_to_video_file/file.mp4

Audio-Video Speech Model (AVSR)

Finding transcript of a video file using AVSR:

$ python ./bin/run_exported_model_AVSR.py -d path_to_exported_model/ -n model_name -vf /path_to_video_file/file.mp4

Running code at Case HPC

Please read run_at_HPC.md for running this project at Case HPC.

Acknowledgments

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 43

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

pandeydivesh15 / Avsr Deep Speech

Programming Languages

Labels

Projects that are alternatives of or similar to Avsr Deep Speech

Audio and Visual Speech Recognition(AVSR) using Deep Learning

Contents

Getting Started

Prerequisites

Installing

Data-Preprocessing for Training

Audio-only Speech Recognition

Audio Visual Speech Recognition (AVSR)

Preparing data for training Autoencoder

Training RBMs and Autoencoder

Preparing data for AVSR

Training

Original DeepSpeech

DeepSpeech_RHL.py

DeepSpeech_RHL_AVSR.py

Checkpointing

Some Training Results

Audio-only Speech Recognition

Exporting model and Testing

Audio only Speech Model

Audio-Video Speech Model (AVSR)

Running code at Case HPC

Acknowledgments