All Projects → scopeInfinity → Video2description

scopeInfinity / Video2description

Licence: apache-2.0
Video to Text: Generates description in natural language for given video (Video Captioning)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Video2description

Auto Editor
Auto-Editor: Effort free video editing!
Stars: ✭ 382 (+257.01%)
Mutual labels:  video-processing, audio-processing
Deep Embedded Memory Networks
https://arxiv.org/abs/1707.00836
Stars: ✭ 19 (-82.24%)
Mutual labels:  deep-neural-networks, video-processing
Deep Learning Time Series
List of papers, code and experiments using deep learning for time series forecasting
Stars: ✭ 796 (+643.93%)
Mutual labels:  deep-neural-networks, lstm-neural-networks
Personality Detection
Implementation of a hierarchical CNN based model to detect Big Five personality traits
Stars: ✭ 338 (+215.89%)
Mutual labels:  lstm-neural-networks, cnn-keras
Sarcasm Detection
Detecting Sarcasm on Twitter using both traditonal machine learning and deep learning techniques.
Stars: ✭ 73 (-31.78%)
Mutual labels:  deep-neural-networks, lstm-neural-networks
Real Time Gesrec
Real-time Hand Gesture Recognition with PyTorch on EgoGesture, NvGesture, Jester, Kinetics and UCF101
Stars: ✭ 339 (+216.82%)
Mutual labels:  deep-neural-networks, video-processing
Arcan
Arcan - [Display Server, Multimedia Framework, Game Engine] -> "Desktop Engine"
Stars: ✭ 885 (+727.1%)
Mutual labels:  video-processing, audio-processing
video-audio-tools
To process/edit video and audio with Python+FFmpeg. [简单实用] 基于Python+FFmpeg的视频和音频的处理/剪辑。
Stars: ✭ 164 (+53.27%)
Mutual labels:  video-processing, audio-processing
Bitcoin Price Prediction Using Lstm
Bitcoin price Prediction ( Time Series ) using LSTM Recurrent neural network
Stars: ✭ 67 (-37.38%)
Mutual labels:  deep-neural-networks, lstm-neural-networks
Image Captioning
Image Captioning: Implementing the Neural Image Caption Generator with python
Stars: ✭ 52 (-51.4%)
Mutual labels:  lstm-neural-networks, image-captioning
Vectorhub
Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)
Stars: ✭ 317 (+196.26%)
Mutual labels:  video-processing, audio-processing
Pytorch Learners Tutorial
PyTorch tutorial for learners
Stars: ✭ 97 (-9.35%)
Mutual labels:  deep-neural-networks, lstm-neural-networks
eloquent-ffmpeg
High-level API for FFmpeg's Command Line Tools
Stars: ✭ 71 (-33.64%)
Mutual labels:  video-processing, audio-processing
Predictive Maintenance Using Lstm
Example of Multiple Multivariate Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras.
Stars: ✭ 352 (+228.97%)
Mutual labels:  deep-neural-networks, lstm-neural-networks
DuME
A fast, versatile, easy-to-use and cross-platform Media Encoder based on FFmpeg
Stars: ✭ 66 (-38.32%)
Mutual labels:  video-processing, audio-processing
Mlt
MLT Multimedia Framework
Stars: ✭ 836 (+681.31%)
Mutual labels:  video-processing, audio-processing
lecture-demos
Demonstrations for the interactive exploration of selected core concepts of audio, image and video processing as well as related topics
Stars: ✭ 12 (-88.79%)
Mutual labels:  video-processing, audio-processing
tennis action recognition
Using deep learning to perform action recognition in the sport of tennis.
Stars: ✭ 17 (-84.11%)
Mutual labels:  video-processing, cnn-keras
Skater
Python Library for Model Interpretation/Explanations
Stars: ✭ 973 (+809.35%)
Mutual labels:  deep-neural-networks, lstm-neural-networks
Automatic Image Captioning
Generating Captions for images using Deep Learning
Stars: ✭ 84 (-21.5%)
Mutual labels:  lstm-neural-networks, image-captioning

Video Captioning Build Status

Generate caption for the given video clip

Branch : VideoCaption (1a2124d), VideoCaption_catt (647e73b4)

Model

Model generates natural sentence word by word

SentenceGenerationImage

Audio SubModel Video SubModel Sentence Generation SubModel
audio_model video_model sentence_generation

Context extraction for Temporal Attention Model, at ith word generation

AttentionModel

Results - f5c22f7

Test videos with good results

12727 12501 10802
two men are talking about a cooking show a woman is cooking a dog is running around a field
12968 12937 12939
a woman is talking about a makeup face a man is driving a car down the road a man is cooking in a kitchen
12683 12901 12994
a man is playing a video game two men are playing table tennis in a stadium a man is talking about a computer program

Test videos with poor results

12589 12966 12908
a person is playing with a toy a man is walking on the field a man is standing in a gym

Try it out!!!

$ docker-compose pull
$ docker-compose up
  • Browse to http://localhost:8080/
    • backend might take few minutes to reach a stable stage.
Execution without Docker
  • We can go always go through backend.Dockerfile and frontend.Dockerfile to understand better.
  • Update src/config.json as per the requirement and use those path during upcoming steps.
    • To know more about any field, just search for the reference in the codebase.
  • Install miniconda
  • Get glove.6B.300d.txt from https://nlp.stanford.edu/projects/glove/
  • Install ffmpeg
    • Configure, build and install ffmpeg from source with shared libraries
$ git clone 'https://github.com/FFmpeg/FFmpeg.git'
$ cd FFmpeg
$ ./configure --enable-shared  # Use --prefix if need to install in custom directory
$ make
# make install
  • If required, use https://github.com/tylin/coco-caption/ for scoring the model.
  • Then create conda environment using environment.yml
    • $ conda env create -f environment.yml
  • And activate the environment
$ conda activate .
  • Turn up the backend
    • src$ python -m backend.parser server --start --model /path/to/model
  • Turn up the web frontend
    • src$ python -m frontend.app

Info

Data Directory and Working Directory can be same as the project root directory.

Data Directory

File Reference
/path/to/data_dir/VideoDataset/videodatainfo_2017.json http://ms-multimedia-challenge.com/2017/dataset
/path/to/data_dir/VideoDataset/videos/[0-9]+.mp4 Download videos based on above dataset
/path/to/data_dir/glove/glove.6B.300d.txt https://nlp.stanford.edu/projects/glove/
/path/to/data_dir/VideoDataset/cache_40_224x224/[0-9]+.npy Video cached files will be created on fly

Working Directory

File Content
/path/to/working_dir/glove.dat Pickle Dumped Glove Embedding
/path/to/working_dir/vocab.dat Pickle Dumped Vocabulary Words

Download Dataset

  • Execute python videohandler.py from VideoDataset Directory

Execution

It currently supports train, predict and server mode. Please use the following command for better explanation.

src$ python -m backend.parse -h

Training Methods

  • Try Iterative Learning
  • Try Random Learning

Evaluation

Prerequisite

cd /path/to/eval_dir/
git clone 'https://github.com/tylin/coco-caption.git' cococaption
ln /path/to/working_dir/cocoeval.py cococaption/

Evaluate

# One can do changes in parser.py for numbers of test examples to be considered in evaluation
python parser.py predict save_all_test
python /path/to/eval_dir/cocoeval.py <results file>.txt

Sample Evaluation while training

Commit Training Total CIDEr Bleu_4 ROUGE_L METEOR Model Filename
647e73b4 10 epochs 1.1642 0.1580 0.3090 0.4917 0.2055 CAttention_ResNet_D512L512_G128G64_D1024D0.20BN_BDGRU1024_D0.2L1024DVS_model.dat_4990_loss_2.484_Cider0.360_Blue0.369_Rouge0.580_Meteor0.256
1a2124d 17 epochs 1.1599 0.1654 0.3022 0.4849 0.2074 ResNet_D512L512_G128G64_D1024D0.20BN_BDLSTM1024_D0.2L1024DVS_model.dat_4987_loss_2.203_Cider0.342_Blue0.353_Rouge0.572_Meteor0.256
f5c22f7 17 epochs 1.1559 0.1680 0.3000 0.4832 0.2047 ResNet_D512L512_G128G64_D1024D0.20BN_BDGRU1024_D0.2L1024DVS_model.dat_4983_loss_2.350_Cider0.355_Blue0.353_Rouge0.571_Meteor0.247_TOTAL_1.558_BEST
bd072ac 11 CPUhrs with Multiprocessing (16 epochs) 1.0736 0.1528 0.2597 0.4674 0.1936 ResNet_D512L512_D1024D0.20BN_BDGRU1024_D0.2L1024DVS_model.dat_4986_loss_2.306_Cider0.347_Blue0.328_Rouge0.560_Meteor0.246
3ccf5d5 15 CPUhrs 1.0307 0.1258 0.2535 0.4619 0.1895 res_mcnn_rand_b100_s500_model.dat_model1_3ccf5d5

Check Specifications section for model comparision.

Temporal attention Model for is on VideoCaption_catt branch.

Pre-trained Models : https://drive.google.com/open?id=1gexBRQfrjfcs7N5UI5NtlLiIR_xa69tK

Web Server

  • Start the server (S) for to compute predictions (Within conda environment)
python parser.py server -s -m <path/to/correct/model>
  • Check config.json for configurations.
  • Execute python app.py from webserver (No need for conda environment)
    • Make sure, your the process is can new files inside $UPLOAD_FOLDER
  • Open http://webserver:5000/ to open Web Server for testing (under default configuration)

Specifications

Commit: 3ccf5d5
  • ResNet over LSTM for feature extraction
  • Word by Word generation based on last prediction for Sentence Generation using LSTM
  • Random Dataset Learning of training data
  • Vocab Size 9448
  • Glove of 300 Dimension
Commit: bd072ac
  • ResNet over BiDirection GRU for feature extraction
  • Sequential Learning of training data
  • Batch Normalization + Few more tweaks in Model
  • Bleu, CIDEr, Rouge, Meteor score generation for validation
  • Multiprocessing keras
Commit: f5c22f7
  • Audio with BiDirection GRU
Commit: 1a2124d
  • Audio with BiDirection LSTM
Commit: 647e73b
  • Audio with BiDirection GRU using temporal attention for context

Image Captioning

Generate caption for the given images

Branch : onehot_gen

Commit : 898f15778d40b67f333df0a0e744a4af0b04b16c

Trained Model : https://drive.google.com/open?id=1qzMCAbh_tW3SjMMVSPS4Ikt6hDnGfhEN

Categorical Crossentropy Loss : 0.58

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].