All Projects → OValery16 → Tutorial About 3d Convolutional Network

OValery16 / Tutorial About 3d Convolutional Network

Licence: mpl-2.0
Tutorial about 3D convolutional network

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Tutorial About 3d Convolutional Network

C Crashcourse
C语言教程+博客+代码演示+课程设计。 帮助初学者更好的理解 C 难点,提升代码量! For beginners:C tuition/self-learning
Stars: ✭ 167 (-5.65%)
Mutual labels:  tutorial
Web Crawler Tutorial
Python 網頁爬蟲入門實戰
Stars: ✭ 170 (-3.95%)
Mutual labels:  tutorial
Spring Cloud Tutorial
Spring Cloud Tutorial.《Spring Cloud 教程》
Stars: ✭ 173 (-2.26%)
Mutual labels:  tutorial
Shape Detection
🟣 Object detection of abstract shapes with neural networks
Stars: ✭ 170 (-3.95%)
Mutual labels:  tutorial
Chrome Extension Book
📚《Chrome Extension 入门指南》
Stars: ✭ 171 (-3.39%)
Mutual labels:  tutorial
Learn Elm Architecture In Javascript
🦄 Learn how to build web apps using the Elm Architecture in "vanilla" JavaScript (step-by-step TDD tutorial)!
Stars: ✭ 173 (-2.26%)
Mutual labels:  tutorial
Glsltuto
GLSL shaders tutorial
Stars: ✭ 168 (-5.08%)
Mutual labels:  tutorial
Invisibility cloak
This is a fun application of image processing which enables you to experience the magic of an invisibility cloak. Let's make our childhood fantasy of using an invisibility cloak come true.
Stars: ✭ 176 (-0.56%)
Mutual labels:  tutorial
Rest Api Basics
This is a basic guide on how to build a REST API with Django & Python. For much deeper depth, check out our new course on REST API: (https://kirr.co/90kxtx)
Stars: ✭ 171 (-3.39%)
Mutual labels:  tutorial
Awesome Flutter
Curated list of bookmarks, packages, tutorials, videos and other cool resources from Google's Flutter
Stars: ✭ 174 (-1.69%)
Mutual labels:  tutorial
Nakama Godot Demo
A demo project with Godot engine and Nakama server.
Stars: ✭ 171 (-3.39%)
Mutual labels:  tutorial
Myapp
🖥️ How to build a Dockerized RESTful API application using Go.
Stars: ✭ 171 (-3.39%)
Mutual labels:  tutorial
Learn Vim
Learn the (arguably) best text-editor on the planet.
Stars: ✭ 173 (-2.26%)
Mutual labels:  tutorial
Cehv10 Notes
📕 Both personal and public notes for EC-Council's CEHv10 312-50, because its thousands of pages/slides of boredom, and a braindump to many
Stars: ✭ 170 (-3.95%)
Mutual labels:  tutorial
Es6 Tutorial
Essentials in JavaScript ES6 - A Fun and Clear Introduction
Stars: ✭ 175 (-1.13%)
Mutual labels:  tutorial
Typescript Definitive Guide
TypeScript: Definitive Guide (book and docs in one)
Stars: ✭ 169 (-4.52%)
Mutual labels:  tutorial
Psi4numpy
Combining Psi4 and Numpy for education and development.
Stars: ✭ 170 (-3.95%)
Mutual labels:  tutorial
Epicsurvivalgameseries
Third-person Survival Game for Unreal Engine 4 (Sample Project)
Stars: ✭ 2,389 (+1249.72%)
Mutual labels:  tutorial
Astropy Tutorials
Tutorials for the Astropy Project
Stars: ✭ 174 (-1.69%)
Mutual labels:  tutorial
Vim Go Tutorial
Tutorial for vim-go
Stars: ✭ 2,058 (+1062.71%)
Mutual labels:  tutorial

Tutorial about 3D convolutional network

For me, Artificial Intelligence is like a passion and I am trying to use it to solve some daily life problems. In this tutorial/project, I want to give some intuitions to the readers about how 3D convolutional neural networks are actually working. There are not a lot of tutorial about 3D convolutional neural networks, and not of a lot of them investigate the logic behind these networks.

"Standard" convolutional network

Before to dive into 3D CNN, let's summarize together what we know about ConvNets. ConvNets consists mainly in 2 parts:

  • The feature extractor: this part of the network takes as input the image and extract the features that are meaningful for its classification. It amplifies aspects of the input that is important for discrimination and suppresses irrelevant variations. Usually, the feature extractor consists of several layers. For instance, an image which could be seen as an array of pixel values. The first layer often learns representations that represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. Finally the third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts.

  • The classifier: this part of the network takes as input the previously computed features and use them to predict the correct label.

CNN

In order to extract such features, ConvNets use 2D convolution operations.

conv2D

Why do we need a 3D CNN?

Traditionally, ConvNets are targeting RGB images (3 channels). The goal of 3D CNN is to take as input a video and extract features from it. When ConvNets extract the graphical characteristics of a single image and put them in a vector (a low-level representation), 3D CNNs extract the graphical characteristics of a set of images. 3D CNNs takes in to account a temporal dimension (the order of the images in the video). From a set of images, 3D CNNs find a low-level representation of a set of images, and this representation is useful to find the right label of the video (a given action is performed)

In order to extract such features, 3D convolution uses 3Dconvolution operations.

conv3D

Is 3D CNN the only solution to video classification?

There are several existing approaches to tackle the video classification. This is a nonexaustive list of existing approaches:

  • ConvNets + LSTM cell: Extract features from each frame with a ConvNet, passing the sequence to an RNN, in a separate network paper, Tutorial on Keras

  • Temporal Relation Networks: Extract features from each frame with a ConvNet and pass the sequence to an MLP paper Github

  • Two-Stream Convolutional Networks: Use 2 CNN, 1 spatial stream ConvNet which process one single frame at a time, and 1 Temporal stream ConvNet which process multi-frame optical flow paper Github

What is the main drawback of 3D CNN?

From my point of view, there are two main drawbacks compared to the other methods:

  • We don't take advantage of previously trained 2D networks (Inception Net, Resnet ...) to extract the video features. This problem is not as important as before recently because several strong 3D CNNs have been build using similar design pattern as the 2D version (paper).

  • The memory: 3D CNNs often use more GPU memory as the traditional 2D CNN. This problem is not as important as before because GPUs tend to have more and more memory.

The dataset

For this tutorial, we will use the 20BN-jester Dataset V1. The 20BN-JESTER dataset is a large collection of densely-labeled video clips that show humans performing pre-defined hand gestures in front of a laptop camera or webcam. You can download it on their website.

After downloading all parts, extract using:

cat 20bn-jester-v1-?? | tar zx

Before to start implementing any network, I advise you to take a quick look of the video (set of images). It is essential that you understand what you need to predict.

The network

As for traditional 2D ConvNet, we net use a set of convolution, max pooling operations to reduce layer after layer the size the of our input data. In this tutorial, the shape of our input is [number of training example in a batch, number of channels, number of images in the video, the height, the width]

The goal is to reduce slowly the graphical dimension (the height and the width) and the temporal dimension (number of images in the video) until you get a vector of dimension [number of training example in a batch, number of channel, 1, 1, 1]. After that, you send this vector to a multi-layers classifier.

Overfitting

In order to reduce the overfitting, I used 3 techniques:

  • Data augmentation: we augment the amount of training data via random affine transformation (RandomAffine(degrees=[-10, 10], translate=[0.15, 0.15]))
  • Dropout: the idea of this technique, basically we want to trade training performance for more generalization. Basically, during training, some fractions of the neurons on a particular layer will be randomly deactivated. This improves generalization because force your layer to learn with different neurons the same "concept".
  • Batch normalization: this technique performs feature scaling and improves gradient flow. It often plays a small in the regularization.

Optimizer

In this tutorial, we use the Adam optimizer with the AMSGRAD optimization described in this paper. Empirically, it tends to improve the convergence of my model. However, the reader should keep in mind that other experiences (from other users) have shown that the AMSGRAD optimization could also lead to worse performance than ADAM.

Vizualization tool

For visualizing the result, I like to use tensorboard. Since tensorboard is not available by default on Pytorch, I used tensorboardX which enable Pytorch to use tensorboard.

How to run this code?

First, you have to set up the config file accordingly (define the right path, the name of your model ...). After you just need to run:

python train.py --config configs/config.json -g 0

For the prediction, run:

python train.py --eval_only True --resume True --config configs/config.json -g 0

-g: the ID of the GPU you are using. If you are using multiple GPU, you can write:

python train.py --config configs/config.json -g 0,1

Note: this code works well for python 3.6 and with pytorch 0.4.

Results

For this network, I got an accuracy of 85% for the validation set. Since there is an accuracy gap between the training set and the validation set, there is plenty of room for improvements.

Below you can find the training curves obtained via tensorboard.

Training

training_loss training_top1 training_top5

Validation

training_loss training_top1 training_top5

Better accuracy

An accuracy of 85% on the validation is nice. However, can we do better?

First, we can observe that the accuracy on the training set is closed to 100%, but the accuracy on the validation set plateau at 85%. There is a difference of 15%, which can be explained by the fact that our model tends to overfit our training dataset. What can you do against that?

Several things can be done:

  • Go deeper: however when you go deeper, you meet the Vanishing gradient problem. If you went to go deeper, I strongly advise you a more advanced 3D CNN architecture such as this work
  • Add more data: Adding more data is always good for the generalization. In our situation, we work with a fixed size dataset. Augmenting the size of our dataset requires data augmentation technique.
  • Combining several predictors to improve the overall prediction. This approach is called Ensemble learning.

Ensemble learning

Let's take a simple example. We have 3 binary classifiers (A,B,C) with a 70% accuracy. A,B,C can be based on 3D convolution architecture as mentioned previously.

For a majority vote with 3 members we can expect 4 outcomes:

All three are correct 0.7 * 0.7 * 0.7 = 0.3429

Two are correct 0.7 * 0.7 * 0.3

  • 0.7 * 0.3 * 0.7
  • 0.3 * 0.7 * 0.7 = 0.4409

Two are wrong 0.3 * 0.3 * 0.7

  • 0.3 * 0.7 * 0.3
  • 0.7 * 0.3 * 0.3 = 0.189

All three are wrong 0.3 * 0.3 * 0.3 = 0.027

We see that most of the times (~44%) the majority vote corrects an error. This majority vote ensemble will be correct an average of ~78% (0.3429 + 0.4409 = 0.7838).

Stacking and 3D CNN

In this implementation, we train first 5 3D CNN separately. Each of them has an accuracy of ~84-85% on the validation set. Then, we combine the predictions in order to increase the overall prediction accuracy.

stacking

And we obtain an accuracy of 88.16% on the validation set.

Important remark

Code based off this GulpIO-benchmarks implementation.

If you like you like this project, feel free to leave a star. (it is my only reward ^^)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].