All Projects → LuoweiZhou → densecap

LuoweiZhou / densecap

Licence: BSD-3-Clause license
Dense video captioning in PyTorch

Projects that are alternatives of or similar to densecap

BMT
Source code for "Bi-modal Transformer for Dense Video Captioning" (BMVC 2020)
Stars: ✭ 192 (+418.92%)
Mutual labels:  transformer, dense-video-captioning, activitynet-captions
dense-video-captioning-pytorch
Second-place solution to dense video captioning task in ActivityNet Challenge (CVPR 2020 workshop)
Stars: ✭ 58 (+56.76%)
Mutual labels:  dense-video-captioning, activitynet-captions
Torchnlp
Easy to use NLP library built on PyTorch and TorchText
Stars: ✭ 233 (+529.73%)
Mutual labels:  transformer
seq2seq-pytorch
Sequence to Sequence Models in PyTorch
Stars: ✭ 41 (+10.81%)
Mutual labels:  transformer
pytorch-lr-scheduler
PyTorch implementation of some learning rate schedulers for deep learning researcher.
Stars: ✭ 65 (+75.68%)
Mutual labels:  transformer
Gpt2 Newstitle
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。
Stars: ✭ 235 (+535.14%)
Mutual labels:  transformer
SSE-PT
Codes and Datasets for paper RecSys'20 "SSE-PT: Sequential Recommendation Via Personalized Transformer" and NurIPS'19 "Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers"
Stars: ✭ 103 (+178.38%)
Mutual labels:  transformer
Meshed Memory Transformer
Meshed-Memory Transformer for Image Captioning. CVPR 2020
Stars: ✭ 230 (+521.62%)
Mutual labels:  transformer
awesome-transformer-search
A curated list of awesome resources combining Transformers with Neural Architecture Search
Stars: ✭ 194 (+424.32%)
Mutual labels:  transformer
Ner Bert Pytorch
PyTorch solution of named entity recognition task Using Google AI's pre-trained BERT model.
Stars: ✭ 249 (+572.97%)
Mutual labels:  transformer
Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+9137.84%)
Mutual labels:  transformer
Relational Rnn Pytorch
An implementation of DeepMind's Relational Recurrent Neural Networks in PyTorch.
Stars: ✭ 236 (+537.84%)
Mutual labels:  transformer
nested-transformer
Nested Hierarchical Transformer https://arxiv.org/pdf/2105.12723.pdf
Stars: ✭ 174 (+370.27%)
Mutual labels:  transformer
Posthtml
PostHTML is a tool to transform HTML/XML with JS plugins
Stars: ✭ 2,737 (+7297.3%)
Mutual labels:  transformer
Transformers-RL
An easy PyTorch implementation of "Stabilizing Transformers for Reinforcement Learning"
Stars: ✭ 107 (+189.19%)
Mutual labels:  transformer
Jddc solution 4th
2018-JDDC大赛第4名的解决方案
Stars: ✭ 235 (+535.14%)
Mutual labels:  transformer
Insight
Repository for Project Insight: NLP as a Service
Stars: ✭ 246 (+564.86%)
Mutual labels:  transformer
VT-UNet
[MICCAI2022] This is an official PyTorch implementation for A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation
Stars: ✭ 151 (+308.11%)
Mutual labels:  transformer
SegSwap
(CVPRW 2022) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery
Stars: ✭ 46 (+24.32%)
Mutual labels:  transformer
sb-nmt
Code for Synchronous Bidirectional Neural Machine Translation (SB-NMT)
Stars: ✭ 66 (+78.38%)
Mutual labels:  transformer

End-to-End Dense Video Captioning with Masked Transformer

This is the source code for our paper End-to-End Dense Video Captioning with Masked Transformer. It mainly supports dense video captioning on generated segments. To generate captions on GT segments, please refer to our new GVD repo and also our notes.

Requirements (Recommended)

  1. Miniconda3 for Python 3.6

  2. CUDA 9.2 and CUDNN v7.1

  3. PyTorch 0.4.0. Follow the instructions to install pytorch and torchvision.

  4. Install other required modules (e.g., torchtext)

pip install -r requirements.txt

Optional: If you would like to use visdom to track training do pip install visdom

Optional: If you would like to use spacy tokenizer do pip install spacy

Note: The code has been tested on a variety of GPUs, including 1080 Ti, Titan Xp, P100, V100 etc. However, for the latest RTX GPUs (e.g., 2080 Ti), CUDA 10.0 and hence PyTorch 1.0 are required. The code needs to be upgraded to PyTorch 1.0.

Data Preparation

Annotation and feature

For ActivityNet, download the re-formatted annotation files from here, decompress and place under directory data. The frame-wise appearance (with suffix _resnet.npy) and motion (with suffix _bn.npy) feature files for each spilt are available [train (27.7GB), val (13.7GB), test (13.6GB)] and should be decompressed and placed under your dataset directory (refer to as feature_root in the configuration files).

Similarly for YouCook2, the annotation files are available here and should be placed under data. The feature files are [train (9.6GB), val (3.2GB), test (1.5GB)].

You could also extract the feature on your own with this code. Note that ActivityNet is processed with an older version of the repo while YouCook2 is processed with the latest code which had a minor change regarding the sampling approach. This accounts for the difference in the formulation of frame_to_second conversion.

Evaluate scripts

Download the dense video captioning evaluation scripts and place it under the tools directory. Make sure you recursively clone the repo. Our code is equavalent to the official evaluation code from ActivityNet 2017 Challenge, but faster. Note that the current evaluation scripts had a few major bugs fixed towards ActivityNet 2018 Challenge.

The evaluate script for event proposal can be found under tools.

Training and Validation

First, set the paths in configuration files (under cfgs) to your own data and feature directories. Create new directories log and results under the root directory to save log and result files.

The example command on running a 4-GPU distributed data parallel job (for ActivityNet):

For Masked Transformer:

CUDA_VISIBLE_DEVICES=0 python3 scripts/train.py --dist_url $dist_url --cfgs_file $cfgs_file \
    --checkpoint_path ./checkpoint/$id --batch_size $batch_size --world_size 4 \
    --cuda --sent_weight $sent_weight | tee log/$id-0 &
CUDA_VISIBLE_DEVICES=1 python3 scripts/train.py --dist_url $dist_url --cfgs_file $cfgs_file \
    --checkpoint_path ./checkpoint/$id --batch_size $batch_size --world_size 4 \
    --cuda --sent_weight $sent_weight | tee log/$id-1 &
CUDA_VISIBLE_DEVICES=2 python3 scripts/train.py --dist_url $dist_url --cfgs_file $cfgs_file \
    --checkpoint_path ./checkpoint/$id --batch_size $batch_size --world_size 4 \
    --cuda --sent_weight $sent_weight | tee log/$id-2 &
CUDA_VISIBLE_DEVICES=3 python3 scripts/train.py --dist_url $dist_url --cfgs_file $cfgs_file \
    --checkpoint_path ./checkpoint/$id --batch_size $batch_size --world_size 4 \
    --cuda --sent_weight $sent_weight | tee log/$id-3

For End-to-End Masked Transformer:

CUDA_VISIBLE_DEVICES=0 python3 scripts/train.py --dist_url $dist_url --cfgs_file $cfgs_file \
    --checkpoint_path ./checkpoint/$id --batch_size $batch_size --world_size 4 \
    --cuda --sent_weight $sent_weight --mask_weight $mask_weight --gated_mask | tee log/$id-0 &
CUDA_VISIBLE_DEVICES=1 python3 scripts/train.py --dist_url $dist_url --cfgs_file $cfgs_file \
    --checkpoint_path ./checkpoint/$id --batch_size $batch_size --world_size 4 \
    --cuda --sent_weight $sent_weight --mask_weight $mask_weight --gated_mask | tee log/$id-1 &
CUDA_VISIBLE_DEVICES=2 python3 scripts/train.py --dist_url $dist_url --cfgs_file $cfgs_file \
    --checkpoint_path ./checkpoint/$id --batch_size $batch_size --world_size 4 \
    --cuda --sent_weight $sent_weight --mask_weight $mask_weight --gated_mask | tee log/$id-2 &
CUDA_VISIBLE_DEVICES=3 python3 scripts/train.py --dist_url $dist_url --cfgs_file $cfgs_file \
    --checkpoint_path ./checkpoint/$id --batch_size $batch_size --world_size 4 \
    --cuda --sent_weight $sent_weight --mask_weight $mask_weight --gated_mask | tee log/$id-3

Arguments: batch_size=14, mask_weight=1.0, sent_weight=0.25, cfgs_file='cfgs/anet.yml', dist_url='file:///home/luozhou/nonexistent_file' (replace with your directory), id indicates the model name.

For YouCook2 dataset, you can simply replace cfgs/anet.yml with cfgs/yc2.yml. To monitor the training (e.g., training & validation losses), start the visdom server with visdom in the background (e.g., tmux). Then, add --enable_visdom as a command argument.

Note that at least 15 GB of free RAM is required for the training. The nonexistent_file will normally be cleaned up automatically, but might need a manual delete if otherwise. More about distributed data parallel see here (0.4.0). You can also run the code with a single GPU by setting world_size=1.

Due to legacy reasons, we store the feature files as individual .npy files, which causes latency in data loading and hence instability during distributed model training. By default, we set the value of num_workers to 1. It could be set up to 6 for a faster data loading. However, if encouter any data parallel issue, try setting it to 0.

Pre-trained Models

The pre-trained models can be downloaded from here (1GB). Make sure you uncompress the file under the checkpoint directory (create one under the root directory if not exists).

Testing

For Masked Transformer (id=anet-2L-gt-mask):

python3 scripts/test.py --cfgs_file $cfgs_file --densecap_eval_file ./tools/densevid_eval/evaluate.py \
    --batch_size 1 --start_from ./checkpoint/$id/model_epoch_$epoch.t7 --id $id-$epoch \
    --val_data_folder $split --cuda | tee log/eval-$id-epoch$epoch

For End-to-End Masked Transformer (id=anet-2L-e2e-mask):

python3 scripts/test.py --cfgs_file $cfgs_file --densecap_eval_file ./tools/densevid_eval/evaluate.py \
    --batch_size 1 --start_from ./checkpoint/$id/model_epoch_$epoch.t7 --id $id-$epoch \
    --val_data_folder $split --learn_mask --gated_mask --cuda | tee log/eval-$id-epoch$epoch

Arguments: epoch=19, split='validation', cfgs_file='cfgs/anet.yml'

This gives you the language evaluation results on the validation set. You need at least 8GB of free GPU memory for the evaluation. The current evaluation script only supports batch_size=1 and is slow (1hr for yc2 and 4hr for anet). We actively welcome pull requests.

Leaderboard (for the test set)

The official evaluation servers are available under ActivityNet and YouCook2. Note that the NEW evaluation scripts from ActivityNet 2018 Challenge are used in both cases.

Notes

We use a different code base for captioning-only models (dense captioning on GT segments). Please contact [email protected] for details. Note that it can potentially work with this code base if you feed in GT segments into the captioning module rather than the generated segments. However, there is no guarantee on reproducing the results from the paper. You can also refer to this implementation where you need to config --att_model to 'transformer'.

Citation

@inproceedings{zhou2018end,
  title={End-to-End Dense Video Captioning with Masked Transformer},
  author={Zhou, Luowei and Zhou, Yingbo and Corso, Jason J and Socher, Richard and Xiong, Caiming},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={8739--8748},
  year={2018}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].