All Projects → ski-net → lipnet

ski-net / lipnet

Licence: other
LipNet with gluon

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to lipnet

Dog-Breed-Identification-Gluon
Kaggle 120种狗分类,Gluon实现
Stars: ✭ 45 (+181.25%)
Mutual labels:  mxnet, gluon
Aws Machine Learning University Accelerated Nlp
Machine Learning University: Accelerated Natural Language Processing Class
Stars: ✭ 1,695 (+10493.75%)
Mutual labels:  mxnet, gluon
Mxnet Gluon Syncbn
MXNet Gluon Synchronized Batch Normalization Preview
Stars: ✭ 78 (+387.5%)
Mutual labels:  mxnet, gluon
Aws Machine Learning University Accelerated Cv
Machine Learning University: Accelerated Computer Vision Class
Stars: ✭ 1,068 (+6575%)
Mutual labels:  mxnet, gluon
Gluon Nlp
NLP made easy
Stars: ✭ 2,344 (+14550%)
Mutual labels:  mxnet, gluon
Ko en neural machine translation
Korean English NMT(Neural Machine Translation) with Gluon
Stars: ✭ 55 (+243.75%)
Mutual labels:  mxnet, gluon
Mxnet Gluon Style Transfer
Neural Style and MSG-Net
Stars: ✭ 105 (+556.25%)
Mutual labels:  mxnet, gluon
Efficientnet
Gluon implementation of EfficientNet and EfficientNet-lite
Stars: ✭ 30 (+87.5%)
Mutual labels:  mxnet, gluon
Imgclsmob
Sandbox for training deep learning networks
Stars: ✭ 2,405 (+14931.25%)
Mutual labels:  mxnet, gluon
Single Path One Shot Nas Mxnet
Single Path One-Shot NAS MXNet implementation with full training and searching pipeline. Support both Block and Channel Selection. Searched models better than the original paper are provided.
Stars: ✭ 136 (+750%)
Mutual labels:  mxnet, gluon
Quantization.mxnet
Simulate quantization and quantization aware training for MXNet-Gluon models.
Stars: ✭ 42 (+162.5%)
Mutual labels:  mxnet, gluon
ResidualAttentionNetwork
A Gluon implement of Residual Attention Network. Best acc on cifar10-97.78%.
Stars: ✭ 104 (+550%)
Mutual labels:  mxnet, gluon
Gluonrank
Ranking made easy
Stars: ✭ 39 (+143.75%)
Mutual labels:  mxnet, gluon
Gluon2pytorch
Gluon to PyTorch deep neural network model converter
Stars: ✭ 70 (+337.5%)
Mutual labels:  mxnet, gluon
Sockeye
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet
Stars: ✭ 990 (+6087.5%)
Mutual labels:  mxnet, gluon
Mxnet Im2rec tutorial
this simple tutorial will introduce how to use im2rec for mx.image.ImageIter , ImageDetIter and how to use im2rec for COCO DataSet
Stars: ✭ 97 (+506.25%)
Mutual labels:  mxnet, gluon
Aws Machine Learning University Accelerated Tab
Machine Learning University: Accelerated Tabular Data Class
Stars: ✭ 718 (+4387.5%)
Mutual labels:  mxnet, gluon
Mxnet Centernet
Gluon implementation of "Objects as Points", aka "CenterNet"
Stars: ✭ 29 (+81.25%)
Mutual labels:  mxnet, gluon
Mxnet.sharp
.NET Standard bindings for Apache MxNet with Imperative, Symbolic and Gluon Interface for developing, training and deploying Machine Learning models in C#. https://mxnet.tech-quantum.com/
Stars: ✭ 134 (+737.5%)
Mutual labels:  mxnet, gluon
gluon-faster-rcnn
Faster R-CNN implementation with MXNet Gluon API
Stars: ✭ 31 (+93.75%)
Mutual labels:  mxnet, gluon

LipNet: End-to-End Sentence-level Lipreading


This is a Gluon implementation of LipNet: End-to-End Sentence-level Lipreading

net_structure

sample output

Requirements

  • Python 3.6.4
  • MXNet 1.3.0
  • Required disk space: 35 GB
pip install -r requirements.txt

The Data

  • The GRID audiovisual sentence corpus (http://spandh.dcs.shef.ac.uk/gridcorpus/)
    • GRID is a large multi-talker audiovisual sentence corpus to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female). Sentences are of the form "put red at G9 now". The corpus, together with transcriptions, is freely available for research use.
  • Video: (normal)(480 M each)
    • Each movie has one sentence consist of 6 words.
  • Align: word alignments (190 K each)
    • One align has 6 words. Each word has start time and end time. But this tutorial needs just sentence because of using ctc-loss.

Pretrained model

You can train the model yourself in the following sections, you can test a pretrained model's inference, or resume training from the model checkpoint. To work with the provided pretrained model, first download it, then run one of the provided Python scripts for inference (infer.py) or training (main.py).

python infer.py model_path='checkpoint/epoches_81_loss_15.7157'
  • Resume training with the following:
python main.py model_path='checkpoint/epoches_81_loss_15.7157'

Prepare the Data

You can prepare the data yourself, or you can download preprocessed data.

Option 1 - Download the preprocessed data

There are two download routes provided for the preprocessed data.

Download and untar the data

To download tar zipped files by link, download the following files and extract in a folder called data in the root of this example folder. You should have the following structure:

/lipnet/data/align
/lipnet/data/datasets

Use AWS CLI to sync the data

To get the folders and files all unzipped with AWS CLI, can use the following command. This will provide the folder structure for you. Run this command from /lipnet/:

 aws s3 sync s3://mxnet-public/lipnet/data .

Option 2 (part 1)- Download the raw dataset

  • Outputs
    • The Total Movies(mp4): 16GB
    • The Total Aligns(text): 134MB
  • Arguments
    • src_path : Path for videos (default='./data/mp4s/')
    • align_path : Path for aligns (default='./data/')
    • n_process : num of process (default=1)
cd ./utils && python download_data.py --n_process=$(nproc)

Option 2 (part 2) Preprocess the raw dataset: Extracting the mouth images from a video and save it

Preprocess (preprocess_data.py)

  • If there is no landmark, it download automatically.
  • Using Face Landmark Detection, It extract the mouth from a video.
  • example:

  • video: ./data/mp4s/s2/bbbf7p.mpg

  • align(target): ./data/align/s2/bbbf7p.align : 'sil bin blue by f seven please sil'

  • Video to the images (75 Frames)

Frame 0 Frame 1 ... Frame 74
...
  • Extract the mouth from images
Frame 0 Frame 1 ... Frame 74
...
  • Save the result images into tgt_path.

How to run the preprocess script

  • Arguments

    • src_path : Path for videos (default='./data/mp4s/')
    • tgt_path : Path for preprocessed images (default='./data/datasets/')
    • n_process : num of process (default=1)
  • Outputs

    • The Total Images(png): 19GB
  • Elapsed time

    • About 54 Hours using 1 process
    • If you use the multi-processes, you can finish the number of processes faster.
      • e.g) 9 hours using 6 processes

You can run the preprocessing with just one processor, but this will take a long time (>48 hours). To use all of the available processors, use the following command:

cd ./utils && python preprocess_data.py --n_process=$(nproc)

Output: Data structure of the preprocessed data

The training data folder should look like :
<train_data_root>
                |--datasets
                        |--s1
                           |--bbir7s
                               |--mouth_000.png
                               |--mouth_001.png
                                   ...
                           |--bgaa8p
                               |--mouth_000.png
                               |--mouth_001.png
                                  ...
                        |--s2
                            ...
                 |--align
                         |--bw1d8a.align
                         |--bggzzs.align
                             ...


Training

After you have acquired the preprocessed data you are ready to train the lipnet model.

  • According to LipNet: End-to-End Sentence-level Lipreading, four (S1, S2, S20, S22) of the 34 subjects are used for evaluation. The other subjects are used for training.

  • To use the multi-gpu, it is recommended to make the batch size $(num_gpus) times larger.

    • e.g) 1-gpu and 128 batch_size > 2-gpus 256 batch_size
  • arguments

    • batch_size : Define batch size (default=64)
    • epochs : Define total epochs (default=100)
    • image_path : Path for lip image files (default='./data/datasets/')
    • align_path : Path for align files (default='./data/align/')
    • dr_rate : Dropout rate(default=0.5)
    • num_gpus : Num of gpus (if num_gpus is 0, then use cpu) (default=1)
    • num_workers : Num of workers when generating data (default=0)
    • model_path : Path of pretrained model (default=None)
python main.py

Test Environment

  • 72 CPU cores

  • 1 GPU (NVIDIA Tesla V100 SXM2 32 GB)

  • 128 Batch Size

    • It takes over 24 hours (60 epochs) to get some good results.

Inference

  • arguments
    • batch_size : Define batch size (default=64)
    • image_path : Path for lip image files (default='./data/datasets/')
    • align_path : Path for align files (default='./data/align/')
    • num_gpus : Num of gpus (if num_gpus is 0, then use cpu) (default=1)
    • num_workers : Num of workers when generating data (default=0)
    • data_type : 'train' or 'valid' (defalut='valid')
    • model_path : Path of pretrained model (default=None)
python infer.py --model_path=$(model_path)
[Target]
['lay green with a zero again',
 'bin blue with r nine please',
 'set blue with e five again',
 'bin green by t seven soon',
 'lay red at d five now',
 'bin green in x eight now',
 'bin blue with e one now',
 'lay red at j nine now']
[Pred]
['lay green with s zero again',
'bin blue with r nine please',
'set blue with e five again',
'bin green by t seven soon',
'lay red at c five now',
'bin green in x eight now',
'bin blue with m one now',
'lay red at j nine now']
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].