All Projects → jayleicn → TVCaption

jayleicn / TVCaption

Licence: MIT license
[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to TVCaption

MTL-AQA
What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment [CVPR 2019]
Stars: ✭ 38 (-48.65%)
Mutual labels:  video-captioning
Video2Language
Generating video descriptions using deep learning in Keras
Stars: ✭ 22 (-70.27%)
Mutual labels:  video-captioning
densecap
Dense video captioning in PyTorch
Stars: ✭ 37 (-50%)
Mutual labels:  video-captioning
visual syntactic embedding video captioning
Source code of the paper titled *Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding*
Stars: ✭ 23 (-68.92%)
Mutual labels:  video-captioning
Video-Cap
🎬 Video Captioning: ICCV '15 paper implementation
Stars: ✭ 44 (-40.54%)
Mutual labels:  video-captioning
delving-deeper-into-the-decoder-for-video-captioning
Source code for Delving Deeper into the Decoder for Video Captioning
Stars: ✭ 36 (-51.35%)
Mutual labels:  video-captioning
Awesome-Captioning
A curated list of Multimodal Captioning related research(including image captioning, video captioning, and text captioning)
Stars: ✭ 56 (-24.32%)
Mutual labels:  video-captioning

TVCaption

PyTorch implementation of MultiModal Transformer (MMT), a method for multimodal (video + subtitle) captioning.

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

TVC Dataset and Task

We extended TVR by collecting extra captions for each annotated moment. This dataset, named TV show Captions (TVC), is a large-scale multimodal video captioning dataset, contains 262K captions paired with 108K moments. We show our annotated captions and model generated captions below. Similar to TVR, the TVC task requires systems to gather information from both video and subtitle to generate relevant descriptions. tvc example

Method: MultiModal Transformer (MMT)

we designed a MultiModal Transformer (MMT) captioning model which follows the classical encoder-decoder transformer architecture. It takes both video and subtitle as encoder inputs to generate the captions from the decoder.

Resources

Getting started

Prerequisites

  1. Clone this repository
git clone --recursive https://github.com/jayleicn/TVCaption.git
cd TVCaption
  1. Prepare feature files Download tvc_feature_release.tar.gz (23GB). After downloading the file, extract it to the data directory.
tar -xf path/to/tvc_feature_release.tar.gz -C data

You should be able to see video_feature under data/tvc_feature_release directory. It contains video features (ResNet, I3D, ResNet+I3D), these features are the same as the video features we used for TVR/XML. Read the code to learn details on how the features are extracted: video feature extraction.

  1. Install dependencies:
  • Python 2.7
  • PyTorch 1.1.0
  • nltk
  • easydict
  • tqdm
  • h5py
  • tensorboardX
  1. Add project root to PYTHONPATH
source setup.sh

Note that you need to do this each time you start a new session.

Training and Inference

  1. Build Vocabulary
bash baselines/transformer_captioning/scripts/build_vocab.sh

Running this command will build vocabulary cache/tvc_word2idx.json from TVC train set.

  1. MMT training
bash baselines/multimodal_transformer/scripts/train.sh CTX_MODE VID_FEAT_TYPE

CTX_MODE refers to the context (video, sub, video_sub) we use. VID_FEAT_TYPE video feature type (resnet, i3d, resnet_i3d).

Below is an example of training MMT with both video and subtitle, where we use the concatenation of ResNet and I3D features for video.

bash baselines/multimodal_transformer/scripts/train.sh video_sub resnet_i3d

This code will load all the data (~30GB) into RAM to speed up training, use --no_core_driver to disable this behavior.

Training using the above config will stop at around epoch 22, around 3 hours with a single 2080Ti GPU. You should get ~45.0 CIDEr-D and ~10.5 BLEU@4 scores on val split. The resulting model and config will be saved at a dir: baselines/multimodal_transformer/results/video_sub-res-*

  1. MMT inference After training, you can inference using the saved model on val or test_public split:
bash baselines/multimodal_transformer/scripts/translate.sh MODEL_DIR_NAME SPLIT_NAME

MODEL_DIR_NAME is the name of the dir containing the saved model, e.g., video_sub-res-*. SPLIT_NAME could be val or test_public.

Evaluation and Submission

We only release ground-truth for train and val splits, to get results on test-public split, please submit your results follow the instructions here: standalone_eval/README.md

Citations

If you find this code useful for your research, please cite our paper:

@inproceedings{lei2020tvr,
  title={TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval},
  author={Lei, Jie and Yu, Licheng and Berg, Tamara L and Bansal, Mohit},
  booktitle={ECCV},
  year={2020}
}

Acknowledgement

This research is supported by grants and awards from NSF, DARPA, ARO and Google.

This code borrowed components from the following projects: recurrent-transformer, OpenNMT-py, transformers, coco-caption, we thank the authors for open-sourcing these great projects!

Contact

jielei [at] cs.unc.edu

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].