All Projects → jssprz → visual_syntactic_embedding_video_captioning

jssprz / visual_syntactic_embedding_video_captioning

Licence: MIT license
Source code of the paper titled *Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding*

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to visual syntactic embedding video captioning

delving-deeper-into-the-decoder-for-video-captioning
Source code for Delving Deeper into the Decoder for Video Captioning
Stars: ✭ 36 (+56.52%)
Mutual labels:  video-captioning, msvd, msr-vtt
MTL-AQA
What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment [CVPR 2019]
Stars: ✭ 38 (+65.22%)
Mutual labels:  representation-learning, video-captioning
Video2Language
Generating video descriptions using deep learning in Keras
Stars: ✭ 22 (-4.35%)
Mutual labels:  video-captioning, video-to-text
autoencoders tensorflow
Automatic feature engineering using deep learning and Bayesian inference using TensorFlow.
Stars: ✭ 66 (+186.96%)
Mutual labels:  representation-learning
game-feature-learning
Code for paper "Cross-Domain Self-supervised Multi-task Feature Learning using Synthetic Imagery", Ren et al., CVPR'18
Stars: ✭ 68 (+195.65%)
Mutual labels:  representation-learning
ethereum-privacy
Profiling and Deanonymizing Ethereum Users
Stars: ✭ 37 (+60.87%)
Mutual labels:  representation-learning
Awesome-Captioning
A curated list of Multimodal Captioning related research(including image captioning, video captioning, and text captioning)
Stars: ✭ 56 (+143.48%)
Mutual labels:  video-captioning
proto
Proto-RL: Reinforcement Learning with Prototypical Representations
Stars: ✭ 67 (+191.3%)
Mutual labels:  representation-learning
Patient2Vec
Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record
Stars: ✭ 85 (+269.57%)
Mutual labels:  representation-learning
ConvLSTM-PyTorch
ConvLSTM/ConvGRU (Encoder-Decoder) with PyTorch on Moving-MNIST
Stars: ✭ 202 (+778.26%)
Mutual labels:  encoder-decoder
Encoder-Forest
eForest: Reversible mapping between high-dimensional data and path rule identifiers using trees embedding
Stars: ✭ 22 (-4.35%)
Mutual labels:  encoder-decoder
jzon
A correct and safe JSON parser.
Stars: ✭ 78 (+239.13%)
Mutual labels:  encoder-decoder
poincare embedding
Poincaré Embedding
Stars: ✭ 36 (+56.52%)
Mutual labels:  representation-learning
MidcurveNN
Computation of Midcurve of Thin Polygons using Neural Networks
Stars: ✭ 19 (-17.39%)
Mutual labels:  encoder-decoder
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+556.52%)
Mutual labels:  pos-tagging
Video-Cap
🎬 Video Captioning: ICCV '15 paper implementation
Stars: ✭ 44 (+91.3%)
Mutual labels:  video-captioning
SimCLR
Pytorch implementation of "A Simple Framework for Contrastive Learning of Visual Representations"
Stars: ✭ 65 (+182.61%)
Mutual labels:  representation-learning
CodeT5
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.
Stars: ✭ 390 (+1595.65%)
Mutual labels:  representation-learning
Embedding
Embedding模型代码和学习笔记总结
Stars: ✭ 25 (+8.7%)
Mutual labels:  encoder-decoder
unsupervised-pos-tagging
教師なし品詞タグ推定
Stars: ✭ 16 (-30.43%)
Mutual labels:  pos-tagging

Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding

PRs Welcome Video Captioning and DeepLearning Source code of a WACV'21 paper MIT License

This repository is the source code for the paper titled Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding. Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. In this paper, we consider syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSR-VTT) dataset.

Table of Contents

  1. Model
  2. Requirements
  3. Manual
  4. Qualitative Results
  5. Quantitative Results
  6. Citation

Model

Video Captioning with Visual-Syntactic Embedding (SemSynAN) Visual-Syntactic Embedding

Requirements

  1. Python 3.6
  2. PyTorch 1.2.0
  3. NumPy
  4. h5py

Manual

git clone --recursive https://github.com/jssprz/visual_syntactic_embedding_video_captioning.git

Download Data

mkdir -p data/MSVD && wget -i msvd_data.txt -P data/MSVD
mkdir -p data/MSR-VTT && wget -i msrvtt_data.txt -P data/MSR-VTT

For extracting your own visual features representations you can use our visual-feature-extracotr module.

Training

If you want to train your own models, you can reutilize the datasets' information stored and tokenized in the corpus.pkl files. For constructing these files you can use the scripts we provide in video_captioning_dataset module. Basically, the content of these files is organized as follow:

0: train_data: captions and idxs of training videos in format [corpus_widxs, vidxs], where:

  • corpus_widxs is a list of lists with the index of words in the vocabulary
  • vidxs is a list of indexes of video features in the features file

1: val_data: same format of train_data.

2: test_data: same format of train_data.

3: vocabulary: in format {'word': count}.

4: idx2word: is the vocabulary in format {idx: 'word'}.

5: word_embeddings: are the vectors of each word. The i-th row is the word vector of the i-th word in the vocabulary.

We use the val_references.txt and test_references.txt files for computing the evaluation metrics only.

Testing

1. Download pre-trained models at epochs 41 (for MSVD) and 12 (for MSR-VTT)

wget https://s06.imfd.cl/04/github-data/SemSynAN/MSVD/captioning_chkpt_41.pt -P pretrain/MSVD
wget https://s06.imfd.cl/04/github-data/SemSynAN/MSR-VTT/captioning_chkpt_12.pt -P pretrain/MSR-VTT

2. Generate captions for test samples

python test.py -chckpt pretrain/MSVD/captioning_chkpt_41.pt -data data/MSVD/ -out results/MSVD/
python test.py -chckpt pretrain/MSR-VTT/captioning_chkpt_12.pt -data data/MSR-VTT/ -out results/MSR-VTT/

3. Metrics

python evaluate.py -gen results/MSVD/predictions.txt -ref data/MSVD/test_references.txt
python evaluate.py -gen results/MSR-VTT/predictions.txt -ref data/MSR-VTT/test_references.txt

Qualitative Results

qualitative results

Quantitative Results

Dataset epoch B-4 M C R
MSVD 100 64.4 41.9 111.5 79.5
MSR-VTT 60 46.4 30.4 51.9 64.7

Citation

@InProceedings{Perez-Martin_2021_WACV,
    author    = {Perez-Martin, Jesus and Bustos, Benjamin and Perez, Jorge},
    title     = {Improving Video Captioning With Temporal Composition of a Visual-Syntactic Embedding},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {3039-3049}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].