All Projects → jeffreyyihuang → Two Stream Action Recognition

jeffreyyihuang / Two Stream Action Recognition

Licence: mit
Using two stream architecture to implement a classic action recognition method on UCF101 dataset

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Two Stream Action Recognition

GST-video
ICCV 19 Grouped Spatial-Temporal Aggretation for Efficient Action Recognition
Stars: ✭ 40 (-94.33%)
Mutual labels:  action-recognition
Realtime Action Detection
This repository host the code for real-time action detection paper
Stars: ✭ 271 (-61.56%)
Mutual labels:  action-recognition
Two Stream Pytorch
PyTorch implementation of two-stream networks for video action recognition
Stars: ✭ 428 (-39.29%)
Mutual labels:  action-recognition
vlog action recognition
Identifying Visible Actions in Lifestyle Vlogs
Stars: ✭ 13 (-98.16%)
Mutual labels:  action-recognition
DEAR
[ICCV 2021 Oral] Deep Evidential Action Recognition
Stars: ✭ 36 (-94.89%)
Mutual labels:  action-recognition
Action Recognition Visual Attention
Action recognition using soft attention based deep recurrent neural networks
Stars: ✭ 350 (-50.35%)
Mutual labels:  action-recognition
ntu-x
NTU-X, which is an extended version of popular NTU dataset
Stars: ✭ 55 (-92.2%)
Mutual labels:  action-recognition
Action Detection
temporal action detection with SSN
Stars: ✭ 597 (-15.32%)
Mutual labels:  action-recognition
3d Resnets Pytorch
3D ResNets for Action Recognition (CVPR 2018)
Stars: ✭ 3,169 (+349.5%)
Mutual labels:  action-recognition
Realtime Action Recognition
Apply ML to the skeletons from OpenPose; 9 actions; multiple people. (WARNING: I'm sorry that this is only good for course demo, not for real world applications !!! Those ary very difficult !!!)
Stars: ✭ 417 (-40.85%)
Mutual labels:  action-recognition
auditory-slow-fast
Implementation of "Slow-Fast Auditory Streams for Audio Recognition, ICASSP, 2021" in PyTorch
Stars: ✭ 46 (-93.48%)
Mutual labels:  action-recognition
DIN-Group-Activity-Recognition-Benchmark
A new codebase for Group Activity Recognition. It contains codes for ICCV 2021 paper: Spatio-Temporal Dynamic Inference Network for Group Activity Recognition and some other methods.
Stars: ✭ 26 (-96.31%)
Mutual labels:  action-recognition
Awesome Skeleton Based Action Recognition
Skeleton-based Action Recognition
Stars: ✭ 360 (-48.94%)
Mutual labels:  action-recognition
kinect-gesture
基于kinect 的人体 动作识别
Stars: ✭ 129 (-81.7%)
Mutual labels:  action-recognition
Gluon Cv
Gluon CV Toolkit
Stars: ✭ 5,001 (+609.36%)
Mutual labels:  action-recognition
DLCV2018SPRING
Deep Learning for Computer Vision (CommE 5052) in NTU
Stars: ✭ 38 (-94.61%)
Mutual labels:  action-recognition
Awesome Action Recognition
A curated list of action recognition and related area resources
Stars: ✭ 3,202 (+354.18%)
Mutual labels:  action-recognition
Mmaction2
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
Stars: ✭ 684 (-2.98%)
Mutual labels:  action-recognition
Video Classification
Tutorial for video classification/ action recognition using 3D CNN/ CNN+RNN on UCF101
Stars: ✭ 543 (-22.98%)
Mutual labels:  action-recognition
Video Understanding Dataset
A collection of recent video understanding datasets, under construction!
Stars: ✭ 387 (-45.11%)
Mutual labels:  action-recognition

two-stream-action-recognition

We use a spatial and motion stream cnn with ResNet101 for modeling video information in UCF101 dataset.

Reference Paper

1. Data

1.1 Spatial input data -> rgb frames

  • We extract RGB frames from each video in UCF101 dataset with sampling rate: 10 and save as .jpg image in disk which cost about 5.9G.

1.2 Motion input data -> stacked optical flow images

In motion stream, we use two methods to get optical flow data.

  1. Download the preprocessed tvl1 optical flow dataset directly from https://github.com/feichtenhofer/twostreamfusion.
  2. Using flownet2.0 method to generate 2-channel optical flow image and save its x, y channel as .jpg image in disk respectively, which cost about 56G.

1.3 (Alternative)Download the preprocessed data directly from feichtenhofer/twostreamfusion)

  • RGB images
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.001
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.002
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.003

cat ucf101_jpegs_256.zip* > ucf101_jpegs_256.zip
unzip ucf101_jpegs_256.zip
  • Optical Flow
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_tvl1_flow.zip.001
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_tvl1_flow.zip.002
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_tvl1_flow.zip.003

cat ucf101_tvl1_flow.zip* > ucf101_tvl1_flow.zip
unzip ucf101_tvl1_flow.zip

2. Model

2.1 Spatial cnn

  • As mention before, we use ResNet101 first pre-trained with ImageNet then fine-tuning on our UCF101 spatial rgb image dataset. 

2.2 Motion cnn

  • Input data of motion cnn is a stack of optical flow images which contained 10 x-channel and 10 y-channel images, So it's input shape is (20, 224, 224) which can be considered as a 20-channel image.
  • In order to utilize ImageNet pre-trained weight on our model, we have to modify the weights of the first convolution layer pre-trained with ImageNet from (64, 3, 7, 7) to (64, 20, 7, 7).
  • In [2] Wang provide a method called **Cross modality pre-

** to do such weights shape transform. He first average the weight value across the RGB channels and replicate this average by the channel number of motion stream input( which is 20 is this case)

3. Training strategies

3.1 Spatial cnn

  • Here we utilize the techniques in Temporal Segment Network. For every videos in a mini-batch, we randomly select 3 frames from each video. Then a consensus among the frames will be derived as the video-level prediction for calculating loss.

3.2 Motion cnn

  • In every mini-batch, we randomly select 64 (batch size) videos from 9537 training videos and futher randomly select 1 stacked optical flow in each video.

3.3 Data augmentation

  • Both stream apply the same data augmentation technique such as random cropping.

4. Testing method

  • For every 3783 testing videos, we uniformly sample 19 frames in each video and the video level prediction is the voting result of all 19 frame level predictions.
  • The reason we choose the number 19 is that the minimun number of video frames in UCF101 is 28 and we have to make sure there are sufficient frames for testing in 10 stack motion stream.

5. Performace

network  top1
Spatial cnn 82.1%
Motion cnn 79.4%
Average fusion 88.5%

6. Pre-trained Model

7. Testing on Your Device

Spatial stream

  • Please modify this path and this funcition to fit the UCF101 dataset on your device.
  • Training and testing
python spatial_cnn.py --resume PATH_TO_PRETRAINED_MODEL
  • Only testing
python spatial_cnn.py --resume PATH_TO_PRETRAINED_MODEL --evaluate

Motion stream

  • Please modify this path and this funcition to fit the UCF101 dataset on your device.
  • Training and testing
python motion_cnn.py --resume PATH_TO_PRETRAINED_MODEL
  • Only testing
python motion_cnn.py --resume PATH_TO_PRETRAINED_MODEL --evaluate
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].