All Projects → JunweiLiang → Object_detection_tracking

JunweiLiang / Object_detection_tracking

Licence: mit
Out-of-the-box code and models for CMU's object detection and tracking system for surveillance videos. Speed optimized Faster-RCNN model. Tensorflow based. Also supports EfficientDet. WACVW'20

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Object detection tracking

Vehicle Detection And Tracking
Computer vision based vehicle detection and tracking using Tensorflow Object Detection API and Kalman-filtering
Stars: ✭ 384 (+73.76%)
Mutual labels:  object-detection, tracking
Sipmask
SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation (ECCV2020)
Stars: ✭ 255 (+15.38%)
Mutual labels:  object-detection, tracking
Person Detection And Tracking
A tensorflow implementation with SSD model for person detection and Kalman Filtering combined for tracking
Stars: ✭ 193 (-12.67%)
Mutual labels:  object-detection, tracking
Multi Camera Live Object Tracking
Multi-camera live traffic and object counting with YOLO v4, Deep SORT, and Flask.
Stars: ✭ 375 (+69.68%)
Mutual labels:  object-detection, tracking
Deep Learning For Tracking And Detection
Collection of papers, datasets, code and other resources for object tracking and detection using deep learning
Stars: ✭ 1,920 (+768.78%)
Mutual labels:  object-detection, tracking
Motbeyondpixels
Monocular multi-object tracking using simple and complementary 3D and 2D cues (ICRA 2018)
Stars: ✭ 155 (-29.86%)
Mutual labels:  object-detection, tracking
Thor
thor: C++ helper library, for deep learning purpose
Stars: ✭ 197 (-10.86%)
Mutual labels:  object-detection, tracking
Pine
🌲 Aimbot powered by real-time object detection with neural networks, GPU accelerated with Nvidia. Optimized for use with CS:GO.
Stars: ✭ 202 (-8.6%)
Mutual labels:  object-detection
Luminoth
Deep Learning toolkit for Computer Vision.
Stars: ✭ 2,386 (+979.64%)
Mutual labels:  object-detection
Neuralet
Neuralet is an open-source platform for edge deep learning models on edge TPU, Jetson Nano, and more.
Stars: ✭ 200 (-9.5%)
Mutual labels:  object-detection
Traffic Sign Detection
Traffic Sign Detection. Code for the paper entitled "Evaluation of deep neural networks for traffic sign detection systems".
Stars: ✭ 200 (-9.5%)
Mutual labels:  object-detection
Ros people object detection tensorflow
An extensive ROS toolbox for object detection & tracking and face/action recognition with 2D and 3D support which makes your Robot understand the environment
Stars: ✭ 202 (-8.6%)
Mutual labels:  object-detection
Syndata Generation
Code used to generate synthetic scenes and bounding box annotations for object detection. This was used to generate data used in the Cut, Paste and Learn paper
Stars: ✭ 214 (-3.17%)
Mutual labels:  object-detection
Sight
👁 Sightseer: TensorFlow library for state-of-the-art Computer Vision and Object Detection models
Stars: ✭ 203 (-8.14%)
Mutual labels:  object-detection
Lfd A Light And Fast Detector
LFD is a big update upon LFFD. Generally, LFD is a multi-class object detector characterized by lightweight, low inference latency and superior precision. It is for real-world appilcations.
Stars: ✭ 210 (-4.98%)
Mutual labels:  object-detection
Yolo Tf
TensorFlow implementation of the YOLO (You Only Look Once)
Stars: ✭ 200 (-9.5%)
Mutual labels:  object-detection
Unidet
Object detection on multiple datasets with an automatically learned unified label space.
Stars: ✭ 217 (-1.81%)
Mutual labels:  object-detection
Pyimsegm
Image segmentation - general superpixel segmentation & center detection & region growing
Stars: ✭ 213 (-3.62%)
Mutual labels:  object-detection
Cnn From Scratch
A scratch implementation of Convolutional Neural Network in Python using only numpy and validated over CIFAR-10 & MNIST Dataset
Stars: ✭ 210 (-4.98%)
Mutual labels:  object-detection
Pytorch simple centernet 45
A simple pytorch implementation of CenterNet (Objects as Points)
Stars: ✭ 208 (-5.88%)
Mutual labels:  object-detection

CMU Object Detection & Tracking for Surveillance Video Activity Detection

This repository contains the code and models for object detection and tracking from the CMU DIVA system. Our system (INF & MUDSML) achieves the best performance on the ActEv leaderboard (Cached).

If you find this code useful in your research then please cite

@inproceedings{chen2019minding,
  title={Minding the Gaps in a Video Action Analysis Pipeline},
  author={Chen, Jia and Liu, Jiang and Liang, Junwei and Hu, Ting-Yao and Ke, Wei and Barrios, Wayner and Huang, Dong and Hauptmann, Alexander G},
  booktitle={2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)},
  pages={41--46},
  year={2019},
  organization={IEEE}
}
@inproceedings{liu2020wacv,
  author = {Liu, Wenhe and Kang, Guoliang and Huang, Po-Yao and Chang, Xiaojun and Qian, Yijun and Liang, Junwei and Gui, Liangke and Wen, Jing and Chen, Peng},
  title = {Argus: Efficient Activity Detection System for Extended Video Analysis},
  booktitle = {The IEEE Winter Conference on Applications of Computer Vision (WACV) Workshops},
  month = {March},
  year = {2020}
}

Introduction

We utilize state-of-the-art object detection and tracking algorithm in surveillance videos. Our best object detection model basically uses Faster RCNN with a backbone of Resnet-101 with dilated CNN and FPN. The tracking algo (Deep SORT) uses ROI features from the object detection model. The ActEV trained models are good for small object detection in outdoor scenes. For indoor cameras, COCO trained models are better.

Updates

  • [12/2020] Added multi-thread inferencing, another ~25% speed up.

  • [12/2020] Added multiple-image batch inferencing, ~30% speed up.

  • [10/2020] Added experiments comparing EfficientDet and MaskRCNN on VIRAT and AVA-Kinetics here.

  • [05/2020] Added EfficientDet (CVPR 2020) for inferencing. The D7 model is reported to be more than 12 mAP better than the Resnet-50 FPN model we used. Modified to be more efficient and tested with Python 2 & 3 and TF 1.15. See example commands and notes here.

  • [02/2020] We used Resnet-50 FPN model trained on MS-COCO for MEVA activity detection and got a competitive pAUDC of 0.49 on the leaderboard with a total processing speed of 0.64x real-time on a 4-GPU machine. The object detection module's processing speed is about 0.125x real-time. [Frozen Model] [Example Command]

  • [01/2020] We discovered a problem with using OpenCV to extract frames for avi videos. Some avi videos have duplicate frames that are not physically presented in the files but only text instructions to duplicate previous frames. The problem is that OpenCV skip these frames without warning according to this bug report and here. Therefore with OpenCV you may get fewer frames which causes the frame index of detection results to be incorrect. Solution: 1. convert the avi videos to mp4 format; 2. use MoviePy or PyAV loader but they are 10% ~ 30% slower than OpenCV frame extraction. See obj_detect_tracking.py for implementation.

Dependencies

The latest inferencing code is tested with Tensorflow-GPU==1.15 and Python 2/3.

Other dependencies: numpy; scipy; sklearn; cv2; matplotlib; pycocotools

Code Overview

  • obj_detect_tracking.py: Inference code for object detection & tracking.
  • models.py: Main model definition.
  • nn.py: Some layer definitions.
  • main.py: Code I used for training and testing experiments.
  • eval.py: Code I used for getting mAP/mAR.
  • vis_json.py: visualize the json outputs.
  • get_frames_resize.py: code for extracting frames from videos.
  • utils.py: some helper classes like getting moving average of losses and GPU usage.

Inferencing

  1. First download some test videos and the v3 model (v4-v6 models are un-verified models as we don't have a test set with ground truth):
$ wget https://aladdin-eax.inf.cs.cmu.edu/shares/diva_obj_detect_models/models/v1-val_testvideos.tgz
$ tar -zxvf v1-val_testvideos.tgz
$ ls v1-val_testvideos > v1-val_testvideos.lst
$ wget https://aladdin-eax.inf.cs.cmu.edu/shares/diva_obj_detect_models/models/obj_v3_model.tgz
$ tar -zxvf obj_v3_model.tgz
  1. Run object detection & tracking on the test videos
$ python obj_detect_tracking.py --model_path obj_v3_model --version 3 --video_dir v1-val_testvideos \
--video_lst_file v1-val_testvideos.lst --frame_gap 1 --get_tracking \
--tracking_dir test_track_out

To have the object detection output in COCO json format, add --out_dir test_json_out; To have the bounding box visualization, add --visualize --vis_path test_vis_out. To speed it up, try --frame_gap 8, and the tracks between detection frames will be linearly interpolated. The tracking results will be in test_track_out/ and in MOTChallenge format.

To run with EfficientDet models, download checkpoint from the official repo or my-d0-snapshot. Then run with --is_efficientdet and --efficientdet_modelname efficientdet-d0.

  1. You can also run inferencing with frozen graph (See this for instructions of how to pack the model). Change --model_path obj_v3.pb and add --is_load_from_pb. It is about 30% faster. For running on MEVA dataset (avi videos & indoor scenes) or with EfficientDet models, see examples here.

  2. You can also run object detection on a list of images. Suppose you have a file list imgs.lst with absolute paths to images. Run with COCO trained MaskRCNN model:

# get model from Tensorpack
$ wget http://models.tensorpack.com/FasterRCNN/COCO-MaskRCNN-R101FPN1x.npz
$ python obj_detect_imgs.py --model_path COCO-MaskRCNN-R101FPN1x.npz --version 2 \
--img_lst imgs.lst --out_dir detection_out_maskrcnn --max_size 480 \
--short_edge_size 320 --is_coco_model --visualize --vis_path detection_vis_maskrcnn

Adjust the image input size as you wish. Run with COCO trained EfficientDet model:

$ https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco/efficientdet-d0.tar.gz; tar -zxvf efficientdet-d0.tar.gz
$ python obj_detect_imgs.py --model_path efficientdet-d0/ --version 2 \
--is_efficientdet --efficientdet_modelname efficientdet-d0 --img_lst imgs.lst \
 --out_dir detection_out_d0 --max_size 480 --short_edge_size 320 --is_coco_model \
 --visualize --vis_path detection_vis_d0

Visualization

To visualize the tracking results:

# Put "Person/Vehicle" tracks visualization into the same video
$ ls $PWD/v1-val_testvideos/* > v1-val_testvideos.abs.lst
$ python get_frames_resize.py v1-val_testvideos.abs.lst v1-val_testvideos_frames/ --use_2level
$ python tracks_to_json.py test_track_out/ v1-val_testvideos.abs.lst test_track_out_json
$ python vis_json.py v1-val_testvideos.abs.lst v1-val_testvideos_frames/ test_track_out_json/ test_track_out_vis
# then use ffmpeg to make videos
$ ffmpeg -framerate 30 -i test_track_out_vis/VIRAT_S_000205_05_001092_001124/VIRAT_S_000205_05_001092_001124_F_%08d.jpg vis_video.mp4

Now you have the tracking visualization videos for both "Person" and "Vehicle" class.

Multiple-Image Batch Inferencing

  1. First download some test videos:
$ wget https://aladdin-eax.inf.cs.cmu.edu/shares/diva_obj_detect_models/meva_outdoor_test.tgz
$ tar -zxvf meva_outdoor_test.tgz
$ ls meva_outdoor_test > meva_outdoor_test.lst
  1. Get the COCO-trained MaskRCNN model from Tensorpack:
$ wget http://models.tensorpack.com/FasterRCNN/COCO-MaskRCNN-R50FPN2x.npz
  1. Run object detection & tracking on the test videos with batch_size=8 code:
$ python obj_detect_tracking_multi.py --model_path COCO-MaskRCNN-R50FPN2x.npz --version 2 \
--video_dir meva_outdoor_test --video_lst_file meva_outdoor_test.lst --frame_gap 8 \
--get_tracking --tracking_dir fpnr50_multib4_trackout_1280x720 --gpuid_start 0 --max_size \
1280 --short_edge_size 720 --is_coco --use_lijun --im_batch_size 8 --log

This should be ~30% faster than the original batch_size=1 code:

$ python obj_detect_tracking.py --model_path COCO-MaskRCNN-R50FPN2x.npz --version 2 \
--video_dir meva_outdoor_test --video_lst_file meva_outdoor_test.lst --frame_gap 8 \
--get_tracking --tracking_dir fpnr50_b1_trackout_1280x720 --gpuid_start 0 --max_size 1280 \
--short_edge_size 720 --is_coco --use_lijun --im_batch_size 1 --log

You can visualize the results according to these instructions. Speed experiments are recorded here.

Multi-Thread Inferencing

Added queue and multi-threading to parallel CPU and GPU. Run object detection & tracking with multi-thread processing on videos:

$ python obj_detect_tracking_multi_queuer.py --model_path COCO-MaskRCNN-R50FPN2x.npz --version 2 \
--video_dir meva_outdoor_test --video_lst_file meva_outdoor_test.lst --frame_gap 8 \
--get_tracking --tracking_dir fpnr50_multib8thread_trackout_1280x720 --gpuid_start 0 --max_size \
1280 --short_edge_size 720 --is_coco --use_lijun --im_batch_size 8 --log --prefetch 10

This should be 20-30% faster than single-thread. Speed experiments are recorded here.

For object detection on list of images, we can have a lot more threads, similar to PyTorch's DataLoader:

$ python obj_detect_imgs_multi_queuer.py --model_path COCO-MaskRCNN-R50FPN2x.npz --version 2 \
--resnet50 --img_lst imgs.lst --out_dir obj_jsons/ --max_size 1920 --short_edge_size 1080 \
--is_coco_model --im_batch_size 8 --log --prefetch 10 --num_cpu_worker 4

Models

These are the models you can use for inferencing. The original ActEv annotations can be downloaded from here. I will add instruction for training and testing if requested. Click to download each model.

Object v2 : Trained on v1-train
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.831 0.405 0.682 0.982 0.725
AR 0.906 0.915 0.899 0.983 0.926
Object v3 (Frozen Graph for tf v1.13) : Trained on v1-train, Dilated CNN
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.836 0.448 0.702 0.984 0.742
AR 0.911 0.910 0.895 0.985 0.925
Object v4 : Trained on v1-train & v1-val, Dilated CNN, Class-agnostic
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.961 0.960 0.971 0.985 0.969
AR 0.979 0.984 0.989 0.985 0.984
Object v5 : Trained on v1-train & v1-val, Dilated CNN, Class-agnostic
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.969 0.981 0.985 0.988 0.981
AR 0.983 0.994 0.995 0.989 0.990
Object v6 : Trained on v1-train & v1-val, Squeeze-Excitation CNN, Class-agnostic
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.973 0.986 0.990 0.987 0.984
AR 0.984 0.994 0.996 0.988 0.990
Object COCO : COCO trained Resnet-101 FPN model. Better for indoor scenes.
Eval on v1-val Person Bike Push_Pulled_Object Vehicle Mean
AP 0.378 0.398 N/A 0.947 N/A
AR 0.585 0.572 N/A 0.965 N/A
Object COCO partial : Same model as above with only Person/Vehicle/Bike classes. Save time on NMS. Use it with `--use_partial_classes`

Activity Box Experiments:

BUPT-MCPRL at the ActivityNet Workshop, CVPR 2019: 3D Faster-RCNN (Numbers taken from their slides)
Evaluation Person-Vehicle Pull Riding Talking Transport_HeavyCarry Vehicle-Turning activity_carrying
AP 0.232 0.38 0.468 0.258 0.183 0.278 0.235
Our Actbox v1: Trained on v1-train, Dilated CNN, Class-agnostic
Eval on v1-val Person-Vehicle Pull Riding Talking Transport_HeavyCarry Vehicle-Turning activity_carrying
AP 0.378 0.582 0.435 0.497 0.438 0.403 0.425
AR 0.780 0.973 0.942 0.876 0.901 0.899 0.899

Training & Testing

Instruction to train a new object detection model is here.

Training & Testing (Activity Box)

Instruction to train a new frame-level activity detection model is here.

Speed Optimization

TL;DR:

  • TF v1.10 -> v1.13 (CUDA 9 & cuDNN v7.1 -> CUDA 10 & cuDNN v7.4) ~ +9% faster
  • Use frozen graph ~ +30% faster
  • Use TensorRT (FP32/FP16) optimized graph ~ +0% faster
  • Use TensorRT (INT8) optimized graph ?

Experiments are recorded here.

Other things I have tried

These are my experiences with working on this surveillance dataset:

  1. FPN provides significant improvement over non-FPN backbone;
  2. Dilated CNN in backbone also helps but Squeeze-Excitation block is unclear (see model obj_v6);
  3. Deformable CNN in backbone seems to achieve same improvement as dilated CNN but my implementation is way too slow.
  4. Cascade RCNN doesn't help (IOU=0.5). I'm using IOU=0.5 in my evaluation since the original annotations are not "tight" bounding boxes.
  5. Decoupled RCNN (using a separate Resnet-101 for box classification) slightly improves AP (Person: 0.836 -> 0.837) but takes 7x more time.
  6. SoftNMS shows mixed results and add 5% more computation time to system (since I used the CPU version). So I don't use it.
  7. Tried Mix-up by randomly mixing ground truth bounding boxes from different frames. Doesn't improve performance.
  8. Focal loss doesn't help.
  9. Relation Network does not improve and the model is huge (my implementation).
  10. ResNeXt does not see significant improvement on this dataset.

TODO

  • Use Python Queue and a separate thread for frame extraction (Done!)
  • Make batch_size > 1 for inferencing (Done!)
  • Make batch_size > 1 for training

Acknowledgements

I made this code by studying the nice example in Tensorpack. The EfficientDet part is modified from the official repo.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].