Here are my personal deep learning notes. I've written this cheatsheet for keep track my knowledge but you can use it as a guide for learning deep learning aswell.

🗂 Data	🧠 Layers	📉 Loss	🔥 Training	✅ Production
Pytorch dataset	Weight init	Cross entropy	Optimizers	Ensemble
Pytorch dataloader	Activations	Weight Decay	Transfer learning	TTA
Split	Self Attention	Label Smoothing	Clean mem	Pseudolabeling
Normalization	Trained CNN	Mixup	Half precision	Webserver (Flask)
Data augmentation	CoordConv	SoftF1	Multiple GPUs	Distillation
Deal imbalance			Precomputation	Pruning
			Set seed	Quantization (int8)
				TorchScript
				ONNX

🗂 Data

Balance the data

If you can not get more data of the underrepresented classes, you can fix the imbalance with code:

Fix it on the dataloader sampler:
- Weighted Random Sampler
  - torch.utils.data.WeightedRandomSampler(weights=[…])
- Subsample majority class. But you can lose important data.
  - catalyst.data.sampler.BalanceClassSampler(labels=ds.targets, mode="downsampling")
- Oversample minority class. But you can overfit.
  - catalyst.data.sampler.BalanceClassSampler(labels=ds.targets, mode="upsampling")
Fix it on the loss function:
- CrossEntropyLoss(weight=[…])

Custom BalanceClassSampler

class BalanceClassSampler(torch.utils.data.Sampler):
    """
    Allows you to create stratified sample on unbalanced classes.
    Inspired from Catalyst's BalanceClassSampler:
    https://catalyst-team.github.io/catalyst/_modules/catalyst/data/sampler.html#BalanceClassSampler

    Args:
        labels: list of class label for each elem in the dataset
        mode: Strategy to balance classes. Must be one of [downsampling, upsampling]
    """

    def __init__(self, labels:list[int], mode:str = "upsampling"):

        labels = np.array(labels)
        self.unique_labels = set(labels)

        ########## STEP 1:
        # Compute the final_num_samples_per_label
        # An Integer
        num_samples_per_label = {label: (labels == label).sum() for label in self.unique_labels}

        if   mode == "upsampling":   self.final_num_samples_per_label = max(num_samples_per_label.values())
        elif mode == "downsampling": self.final_num_samples_per_label = min(num_samples_per_label.values())
        else:                        raise Exception("mode should be: \"downsampling\" or \"upsampling\"")

        ########## STEP 2:
        # Compute actual indices of every label.
        # A Diccionary of lists
        self.indices_per_label = {label: np.arange(len(labels))[labels==label].tolist() for label in self.unique_labels}


    def __iter__(self): #-> Iterator[int]:

        indices = []
        for label in self.unique_labels:

            label_indices = self.indices_per_label[label]

            repeat_all_elementes  = self.final_num_samples_per_label // len(label_indices)
            pick_random_elementes = self.final_num_samples_per_label %  len(label_indices)

            indices += label_indices * repeat_all_elementes # repeat the list several times
            indices += random.sample(label_indices, k=pick_random_elementes)  # pick random idxs without repetition

        assert len(indices) == self.__len__()
        np.random.shuffle(indices) # Inplace shuffle the list

        return iter(indices)
    

    def __len__(self) -> int:
        return self.final_num_samples_per_label * len(self.unique_labels)

Split in train and validation

Training set: used for learning the parameters of the model.
Validation set: used for evaluating model while training. Don’t create a random validation set! Manually create one so that it matches the distribution of your data. Usaully a 10% or 20% of your train set.
- N-fold cross-validation. Usually 10
Test set: used to get a final estimate of how well the network works.

Normalization

Scale the inputs to have mean 0 and a variance of 1. Also linear decorrelation/whitening/pca helps a lot. Normalization parameters are obtained only from train set, and then applied to both train and valid sets.

Option 1: Standarization x = x-x.mean() / x.std() Most used
1. Mean subtraction: Center the data to zero. x = x - x.mean() fights vanishing and exploding gradients
2. Standardize: Put the data on the same scale. x = x / x.std() improves convergence speed and accuracy
Option 2: PCA Whitening
1. Mean subtraction: Center the data in zero. x = x - x.mean()
2. Decorrelation or PCA: Rotate the data until there is no correlation anymore.
3. Whitening: Put the data on the same scale. whitened = decorrelated / np.sqrt(eigVals + 1e-5)
Option 3: ZCA whitening Zero component analysis (ZCA).
Other options not used:
- (x-x.min()) / (x.max()-x.min()): Values from 0 to 1
- 2*(x-x.min()) / (x.max()-x.min()) - 1: Values from -1 to 1

In case of images, the scale is from 0 to 255, so it is not strictly necessary normalize.

neural networks data preparation

Data augmentation

Cutout: Remove parts
- Parámetro: Elegir el tamaño correto de cuadrado: 16px por ejemplo.
Mixup: Mix 2 samples (both x & y) x = λxᵢ + (1−λ)xⱼ & y = λyᵢ + (1−λ)yⱼ. Fast.ai doc
- Parámetro: Elegir λ sampleando la distribución beta α=β=0.4 ó 0.2 (Así pocas veces la imgs se mezclarán)
CutMix: Mix 2 samples in some parts. Fast.ai doc
AugMix: No loos info.
RandAugment
AutoAugment

WandB post with TF2 code

Image data aug

Augmentation	Description	Pillow
Rotate	Rotate some degrees	pil_img.rotate()
Translate		pil_img.transform()
Shear	Affine transform	pil_img.transform()
Autocontrast	Equalize the histogram (linear)	PIL.ImageOps.autocontrast()
Equalize	Equalize the histogram (non-linear)	PIL.ImageOps.equalize()
Posterize	Reducing pixel bits	PIL.ImageOps.posterize()
Solarize	Inverting colors above a threshold	PIL.ImageOps.solarize()
Color		PIL.ImageEnhance.Color()
Contrast		PIL.ImageEnhance.Contrast()
Brightness		PIL.ImageEnhance.Brightness()
Sharpness	Sharpen or blurs the image	PIL.ImageEnhance.Sharpness()

Interpolations when rotate, translate or affine:

Image.BILINEAR
etc

🧠 Model

Weight init

Depends on the models architecture. Try to avoid vanishing or exploding outputs. blog1, blog2.

Constant value: Very bad
Random:
- Uniform: From 0 to 1. Or from -1 to 1. Bad
- Normal: Mean 0, std=1. Better
Xavier initialization: Good for MLPs with tanh activation func. paper
- Uniform:
- Normal:
Kaiming initialization: Good for MLPs with ReLU activation func. (a.k.a. He initialization) paper
- Uniform
- Normal
- When you use Kaiming, you ha to fix ReLU(x) equals to min(x,0) - 0.5 for a correct mean (0)
Delta-Orthogonal initialization: Good for vanilla CNNs (10000 layers). Read this paper

def weight_init(m):

	# LINEAR
	if type(m) == nn.Linear:
		torch.nn.init.xavier_uniform(m.weight)
		m.bias.data.fill_(0.01)

	# CONVS
	classname = m.__class__.__name__
	if classname.find('Conv') != -1:
		nn.init.xavier_uniform_(m.weight, gain=nn.init.calculate_gain('relu'))
		nn.init.zeros_(m.bias)

model.apply(weight_init)

Activations

reference

Softmax: Sigle-label classification (last layer)
Sigmoid: Multi-label classification (last layer)
Hyperbolic tangent:
ReLU: Non-linearity compontent of the net (hidden layers) check this paper
ELU: Exponential Linear Unit. paper
SELU: Scaled Exponential Linear Unit. paper
PReLU or Leaky ReLU:
GLU: Gated Linear Unit. (from TabNet paper) blog linear1(x) * sigmoid(linear2(x))
SERLU:
Smoother ReLU. Differienzable. BEST
- GeLU: Gaussian Error Linear Units. Used in transformers. paper. (2016)
- Swish: x * sigmoid(x) paper (2017)
- Elish: xxxx paper (2018)
- Mish: x * tanh( ln(1 + e^x) ) paper (2019)
- myActFunc 1 = 0.5 * x * ( tanh(x) + 1 )
- myActFunc 2 = 0.5 * x * ( tanh (x+1) + 1)
- myActFunc 3 = x * ((x+x+1)/(abs(x+1) + abs(x)) * 0.5 + 0.5)

CoordConv

class AddCoord2D(torch.nn.Module):
    def __init__(self, len):
        super(AddCoord2D, self).__init__()
        
        i_coord = torch.linspace(start=1/len, end=1, steps=len).view(len, -1).expand(-1, len)
        j_coord = torch.linspace(start=1/len, end=1, steps=len).view(-1, len).expand(len, -1)
        self.coords = torch.stack([i_coord, j_coord])

        print(self.coords.shape)

    def forward(self, x): # X shape: [BS, C, X, Y]
        BS = x.shape[0]
        return torch.cat((x, self.coords.expand(BS,-1,-1,-1)), dim=1)

🧐 Regularization

Dropout

During training, some neurons will be deactivated randomly. Hinton, 2012, Srivasta, 2014

Weight regularization

Weight penalty: Regularization in loss function (penalice high weights). Weight decay hyper-parameter usually 0.0005.

Visually, the weights only can take a value inside the blue region, and the red circles represent the minimum. Here, there are 2 weight variables.

L1 (LASSO)	L2 (Ridge)	Elastic Net

Shrinks coefficients to 0. Good for variable selection	Most used. Makes coefficients smaller	Tradeoff between variable selection and small coefficients
Penalizes the sum of absolute weights	Penalizes the sum of squared weights	Combination of 2 before
`loss + wd * weights.abs().sum()`	`loss + wd * weights.pow(2).sum()`

DropConnect

At training and inference, some connections (weights) will be deactivated permanently. LeCun, 2013. This is very useful at the firsts layers.

Distillation

Knowledge Distillation (teacher-student) A teacher model teach a student model.

Smaller student model → faster model.
- Model compresion: Less memory and computation
- To generalize and avoid outliers.
- Used in NLP transformers.
- paper
Bigger student model is → more accurate model.
- Useful when you have extra unlabeled data (kaggle competitions)
- 1. Train the teacher model with labeled dataset.
- 2. With the extra on unlabeled dataset, generate pseudo labels (soft or hard labels)
- 3. Train a student model on both labeled and pseudo-labeled datasets.
- 4. Student becomes teacher and repeat -> 2.
- Paper: When Does Label Smoothing Help?
- Paper: Noisy Student
- Video: Noisy Student

📉 Loss

Loss function

Regression
- MBE: Mean Bias Error: mean(GT - pred) It could determine if the model has positive bias or negative bias.
- MAE: Mean Absolute Error (L1 loss): mean(|GT - pred|) The most simple.
- MSE: Mean Squared Error (L2 loss): mean((GT-pred)²) Penalice large errors more than MAE. Most used
- RMSE: Root Mean Squared Error: sqrt(MSE) Proportional to MSE. Value closer to MAE.
- Percentage errors:
  - MAPE: Mean Absolute Percentage Error
  - MSPE: Mean Squared Percentage Error
  - RMSPE: Root Mean Squared Percentage Error
Classification
- Cross Entropy: Sigle-label classification. Usually with softmax. nn.CrossEntropyLoss.
  - NLL: Negative Log Likelihood is the one-hot encoded target simplified version, see this nn.NLLLoss()
- Binary Cross Entropy: Multi-label classification. Usually with sigmoid. nn.BCELoss
- Hinge: Multi class SVM Loss nn.HingeEmbeddingLoss()
- Focal loss: Similar to BCE but scaled down, so the network focuses more on incorrect and low confidence labels than on increasing its confidence in the already correct labels. -(1-p)^gamma * log(p) paper
Segmentation
- Pixel-wise cross entropy
- IoU (F0): (Pred ∩ GT)/(Pred ∪ GT) = TP / TP + FP * FN
- Dice (F1): 2 * (Pred ∩ GT)/(Pred + GT) = 2·TP / 2·TP + FP * FN
  - Range from 0 (worst) to 1 (best)
  - In order to formulate a loss function which can be minimized, we'll simply use 1 − Dice

Label Smoothing

Smooth the one-hot target label.

LabelSmoothingCrossEntropy(eps:float=0.1, reduction='mean')

Referennce

Blog: When Does Label Smoothing Help?

Paper: When Does Label Smoothing Help?

📈 Metrics

Classification Metrics

Dataset with 5 disease images and 20 normal images. If the model predicts all images to be normal, its accuracy is 80%, and F1-score of such a model is 0.88

Accuracy: TP + TN / TP + TN + FP + FN
F1 Score: 2 * (Prec*Rec)/(Prec+Rec)
- Precision: TP / TP + FP = TP / predicted possitives
- Recall: TP / TP + FN = TP / actual possitives
Dice Score: 2 * (Pred ∩ GT)/(Pred + GT)
ROC, AUC:
Log loss:

🔥 Train

Learning Rate

How big the steps are during training.

Max LR: Compute it with LR Finder (lr_find())
LR schedule:
- Constant: Never use.
- Reduce it gradually: By steps, by a decay factor, with LR annealing, etc.
  - Flat + Cosine annealing: Flat start, and then at 50%-75%, start dropping the lr based on a cosine anneal.
- Warm restarts (SGDWR, AdamWR):
- OneCycle: Use LRFinder to know your maximum lr. Good for Adam.

Batch size

Number of samples to learn simultaneously.

Batch size = 1: Train each sample individually. (Online gradient descent) ❌
Batch size = length(dataset): Train the whole dataset at once, as a batch. (Batch gradient descent) ❌
Batch size = number: Train disjoint groups of samples (Mini-batch gradient descent). ✅
- Usually a power of 2. 32 or 64 are good values.
- Too low: like 4: Lot of updates. Very noisy random updates in the net (bad).
- Too high: like 512 Few updates. Very general common updates (bad).
  - Faster computation. Takes advantage of GPU mem. But sometimes it can no be fitted (CUDA Out Of Memory)

Some people are tring to make a batch size finder according to this paper.

Number of epochs

Times to learn the whole dataset.

Train until start overffiting (validation loss becomes to increase) (early stopping)

Optimizers

	Description	Paper	Fast.ai 2	Score
SGD	Basic method. `new_w = w - lr * grad_w`		SGD(lr=0.1)
SGD with Momentum	Speed it up with momentum, usually `mom=0.9`		SGD(lr=0.1, mom=0.9)
AdaGrad	Adaptative lr	2011	-
RMSProp	Similar to momentum but with the gradient squared.	2012	RMSProp(lr=0.1)
Adam	Momentum + RMSProp.	2014	Adam(lr=0.1, wd=0)	⭐
LARS	Compute lr for each layer with a certain trust.	2017	Larc(lr=0.1, clip=False)
LARC	Original LARS clipped to be always less than lr		Larc(lr=0.1, clip=True)
AdamW	Adam + decoupled weight decay	2017
AMSGrad	Worse than Adam in practice. (AdamX: new verion)	2018
QHAdam	Quasi-Hyperbolic Adam	2018	QHAdam(lr=0.1)
LAMB	LARC with Adam	2019	Lamb(lr=0.1)
NovoGrad	.	2019
Lookahead	Stabilizes training at the rest of training.	2019	Lookahead(SGD(lr=0.1))
RAdam	Rectified Adam. Stabilizes training at the start.	2019	RAdam(lr=0.1)
Ranger	RAdam + Lookahead.	2019	ranger()	⭐⭐⭐
RangerLars	RAdam + Lookahead + LARS. (aka Over9000)	2019		⭐⭐⭐
Ralamb	RAdam + LARS.	2019
Selective-Backprop	Faster training by focusing on the biggest losers.	2019
DiffGrad	Solves Adam’s "overshoot" issue	2019
AdaMod	Optimizer with memory	2019
DeepMemory	DiffGrad + AdaMod

SGD: new_w = w - lr[gradient_w]
SGD with Momentum: Usually mom=0.9.
- mom=0.9, means a 10% is the normal derivative and a 90% is the same direction I went last time.
- new_w = w - lr[(0.1 * gradient_w) + (0.9 * w)]
- Other common values are 0.5, 0.7 and 0.99.
RMSProp (Adaptative lr) From 2012. Similar to momentum but with the gradient squared.
- new_w = w - lr * gradient_w / [(0.1 * gradient_w²) + (0.9 * w)]
- If the gradient in not so volatile, take grater steps. Otherwise, take smaller steps.
DiffGrad
AdaMod

Optimizers in Fast.ai

You can build every optimizer by doing 2 things:

Stats: keep track of whats is going on on the parameters
Steppers: Figure out how to update the parameters

TODO: Read:

Efficient BackProp (1998, Yann LeCun)

LR finder

blog

paper

Superconvergence

A disciplined approach to neural network hyper-parameters (2018, Leslie Smith)

The 1cycle policy

Set seed

def seed_everything(seed):
	os.environ['PYTHONHASHSEED'] = str(seed)
	random.seed(seed)         # Random
	np.random.seed(seed)      # Numpy
	torch.manual_seed(seed)   # Pytorch
	torch.cuda.manual_seed(seed)
	torch.backends.cudnn.deterministic = True
	torch.backends.cudnn.benchmark     = False
	#tf.random.set_seed(seed) # Tensorflow

Clean mem

Read this

def clean_mem():
	gc.collect()
	torch.cuda.empty_cache()

Multiple GPUs

learn.to_parallel()

Reference

https://dev.fast.ai/distributed

Half precision

learn.to_fp16()
learn.to_fp32()

Reference

http://dev.fast.ai/callback.fp16

✅ Production

Webserver

SERVER (Flask)

import numpy as np
import torch
from torchvision import models
import torchvision.transforms as transforms
from PIL import Image
from flask import Flask, jsonify, request
import json


app = Flask(__name__)
app.config['JSON_SORT_KEYS'] = False

classes = json.load(open('imagenet_classes.json'))
model = models.densenet121(pretrained=True)
model.eval()

def pre_process(image_file):
    my_transforms = transforms.Compose([transforms.Resize(255),
                                        transforms.CenterCrop(224),
                                        transforms.ToTensor(),
                                        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])
    image = Image.open(image_file)
    return my_transforms(image).unsqueeze(0) # unsqueeze is for the BS dim

def post_process(logits):
    vals, idxs = logits.softmax(1).topk(5)
    vals = vals[0].numpy()
    idxs = idxs[0].numpy()
    result = {}
    for idx, val in zip(idxs, vals):
        result[classes[idx]] = round(float(val), 4)
    return result

def get_prediction(image_file):
    with torch.no_grad():
        image_tensor  = pre_process(image_file)
        output = model.forward(image_tensor)
        return post_process(output)

@app.route('/predict', methods=['POST'])
def predict():
    if request.method == 'POST':
        image_file = request.files['my_img_file']
        result_dict = get_prediction(image_file)
        #return jsonify(result_dict)
        #return json.dumps(result_dict)
        return result_dict

if __name__ == '__main__':
    app.run()

Run server

FLASK_ENV=development FLASK_APP=app.py flask run

CLIENT (command line)

curl -X POST -F [email protected] http://localhost:5000/predict

CLIENT (python)

import requests
resp = requests.post("http://localhost:5000/predict",
                     files={"my_img_file": open('cardigan.jpg','rb')})
print(resp.json())

Example server response

{
  "cardigan": 0.7083, 
  "wool": 0.0837, 
  "suit": 0.0431, 
  "Windsor_tie": 0.031, 
  "trench_coat": 0.0307
}

Quantization

3 options

	What	Accuracy	Pytorch API
Dynamic Quantization	Weights only	Good	`qmodel = torch.quantization.quantize_dynamic(model, dtype=torch.qint8)`
Post Training Quantization	Weights and activations	Good	`model.qconfig = torch.quantization.default_qconfig` `torch.quantization.prepare(model, inplace=True)` `torch.quantization.convert(model, inplace=True)`
Quantization-Aware Training	Weights and activations	Best	`torch.quantization.prepare_qat` -> `torch.quantization.convert`

Reference

Official Video (10 mins)

Official Docs

Official Blog

Pruning

import torch.nn.utils.prune as prune

parameters_to_prune = (
    (model.conv1, 'weight'),
    (model.conv2, 'weight'),
    (model.fc1, 'weight'),
    (model.fc2, 'weight'),
    (model.fc3, 'weight'),
)

Percentage Pruning

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.2,
)

Threshold Pruning

class ThresholdPruning(prune.BasePruningMethod):
    PRUNING_TYPE = "unstructured"
    def __init__(self, threshold): self.threshold = threshold
    def compute_mask(self, tensor, default_mask): return torch.abs(tensor) > self.threshold
    
prune.global_unstructured(
    parameters_to_prune,
    pruning_method=ThresholdPruning,
    threshold=0.01
)

See pruning results

def pruned_info(model):
    print("Weights pruned:")
    print("==============")
    total_pruned, total_weights = 0,0
    for name, chil in model.named_children():
        layer_pruned  = torch.sum(chil.weight == 0)
        layer_weights = chil.weight.nelement()
        total_pruned += layer_pruned
        total_weights  += layer_weights

        print(name, "\t{:.2f}%".format(100 * float(layer_pruned)/ float(layer_weights)))
    print("==============")
    print("Total\t{:.2f}%".format(100 * float(total_pruned)/ float(total_weights)))
    
# Weights pruned:
# ==============
# conv1  1.85%
# conv2  8.10%
# fc1    19.76%
# fc2    10.66%
# fc3    9.40%
# ==============
# Total  17.90%

Iterative magnitude pruning is iterative process of removing connections (Prune/Train/Repeat):

Train a big model
Do early stopping
Compress model
- Prune: Find the 15% of weights with the smallest magnitude and set them to zero.
- Train: Then finetune the model until it reaches within 99.5% of its original validation accuracy.
- Repeat: Then prune another 15% of the smallest magnitude weights and finetune.

At the end you can have pruned the 15%, 30%, 45%, 60%, 75%, and 90% of your original model.

Reference

Code:

Pytorch pruning tutorial

Papers:

Deep Compression (2015)

Train Large, Then Compress (2020)

Neural Networks are Surprisingly Modular (2020)

TorchScript

An intermediate representation of a PyTorch model

torch_script = torch.jit.script(MyModel())
torch_script.save("my_model_script.pt")

Reference

Pytorch TorchScript tutorial

Video: Research to Production: PyTorch JIT/TorchScript (10 mins)

Video: From Research to Production with PyTorch (46 mins)

ONNX

torch.onnx.export(model, img, f, verbose=False, opset_version=11)  # Export to onnx

# Check onnx model
import onnx

model = onnx.load(f)  # load onnx model
onnx.checker.check_model(model)  # check onnx model
print(onnx.helper.printable_graph(model.graph))  # print a human readable representation of the graph
print('Export complete. ONNX model saved to %s\nView with https://github.com/lutzroeder/netron' % f)

Reference

Pytorch ONNX docs

Pytorch ONNX tutorial

ONNX_export.py for YOLOv5

🧐 Improve generalization
and avoid overfitting

(try in that order)

Get more data
- Similar datasets: Get a similar dataset for your problem.
- Create your own dataset
  - Segmentation annotation with Polygon-RNN++
- Synthetic data: Virtual objects and scenes instead of real images. Infinite possibilities of lighting, colors, angles...
Data augmentation: Augment your current data. (albumentations for faster aug. using the GPU)
- Test time augmentation (TTA): The same augmentations will also be applied when we are predicting (inference). It can improve our results if we run inference multiple times for each sample and average out the predictions.
- AutoAugment: RL for data augmentation. Trasfer learning NOT THE WEIGHTS but the policies of how to do data augmentation.
Regularization
- Dropout. Usually 0.5
- Weight penalty: Regularization in loss function (penalice high weights). Usually 0.0005
  - L1 regularization: penalizes the sum of absolute weights.
  - L2 regularization: penalizes the sum of squared weights by a factor, usually 0.01 or 0.1.
  - Weight decay: wd * w. Sometimes mathematically identical to L2 reg.
Reduce model complexity: Limit the number of hidden layers and the number of units per layer.
- Generalizable architectures?: Add more bachnorm layers, more densenets...
Ensambles: Gather a bunch of models to give a final prediction. kaggle ensembling guide
- Combination methods:
  - Ensembling: Merge final output (average, weighted average, majority vote, weighted majority vote).
  - Meta ensembling: Same but use a new model to produce the final output. (also called stacking or blending)
- Models generation techniques:
  - Stacking: Just use different classifiers algorithms.
  - Bagging (Bootstrap aggregating): Each model trained with a subset of the training data. Used in random forests. Prob of sample being selected: 0.632 Prob of sample in Out Of Bag 0.368
  - Boosting: The predictors are not made independently, but sequentially. Used in gradient boosting.
  - Snapshot Ensembling: Only for neural nets. M models for the cost of 1. Thanks to SGD with restarts you have several local minimum that you can average. paper.

Other tricks:

Label Smoothing: Smooth the one-hot target label

Knowledge Distillation: A bigger trained net (teacher) helps the network paper

🕓 Train faster

Transfer learning: Use a pretrainded model and retrain with your data.
1. Replace last layer
2. Fine-tune new layers
3. Fine-tune more layers (optional)
Batch Normalization Add BachNorm layers after your convolutions and linear layers for make things easier to your net and train faster.
Precomputation
1. Freeze the layers you don’t want to modify
2. Calculate the activations the last layer from the frozen layers(for your entire dataset)
3. Save those activations to disk
4. Use those activations as the input of your trainable layers
Half precision (fp16)
Multiple GPUs
2nd order optimization

Normalization inside network:

Batch Normalization paper

Layer Normalization paper

Instance Normalization paper

Group Normalization paper

Supervised DL

Structured
- Tabular
  - xDeepFM
  - Andres solution to ieee-fraud-detection
  - NODE: Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data paper
  - Continuous variables: Feed them directly to the network
  - Categorical variable: Use embeddings
- Collaborative filtering: When you have users and items. Useful for recommendation systems.
  - Singular Value Decomposition (SVD)
  - Metrics: Mean Average Precision (MAP)
- Time series
  - Arimax
  - IoT sensors
- Geospatial: Do Kaggle course
Unstructured
- Vision: Image, Video. Check my vision repo
- Audio: Sound, music, speech. Check my audio repo. Audio overview
- NLP: Text, Genomics. Check my NLP repo
- Knoledge Graph (KG): Graph Neural Networks (GNN)
  - Molecules
- Trees
  - math expresions
  - syntax
  - Models: Tree-LSTM, RNNGrammar (RNNG).
  - Tree2seq by Polish notation. Duda: only for binary trees?

Autoencoder

Standard autoencoders: Made for reconstruct the input. No continuous latant space.
- Simple Autoencoder: Same input and output net with a smaller middle hidden layer (botleneck layer, latent vector).
- Denoising Autoencoder (DAE): Adds noise to the input to learn how to remove noise.
- Only have a recontruction loss (pixel mean squared error for example)
Variational Autoencoder (VAE): Initially trained as a reconstruction problem, but later we can play with the latent vector to generate new outputs. Latant space need to be continuous.
- Latent vector: Is modified by adding gaussian noise (normal distribution, mean and std vectors) during training.
- Loss: loss = recontruction loss + latent loss
  - Recontruction loss: Keeps the output similar to the input (mean squared error)
  - Latent loss: Keeps the latent space continuous (KL divergence)
- Disentangled Variational Autoencoder (β-VAE): Improved version. Each parameter of the latent vector is devotod to tweak 1 characteristic. paper.
  - β to small: Overfitting. Learn to reconstruct your training data, but i won't generalize
  - β to big: Loose high definition details. Worse performance.
Hierarchial VAE (HVAE):
- Can be thought of as a series of VAEs stacked on top of each other
NVAE: Hierarchial VAE to the extreme

Neural Representations

2D: [x,y]->[R,G,B]

3D: [x,y,z]->[R,G,B,alpha]

Input coordinates with sine & cos (positional encoding) NeRF
Replacing the ReLU activations with sine functions SIREN
Input coordinates into a Fourier feature space Fourier

Mejoras sobre el NeRF

Description	Website	Video	Paper
NeRF in the Wild	web	3:41	Aug 2020
NeRF++			Oct 2020
Deformable NeRF (nerfies)	web	7:26	Nov 2020
NeRF with time dimension	web	2:21	Nov 2020
NeRF with better weight init	web	3:54	Dec 2020

Graph Neural Networks

Type of graph data
- Graph Databases
- Knowledge Graphs (KG): Describes real-world entities and their interrelations
- Social Networks
- Transport Graphs
- Molecules (including proteins): Make predictions about their properties and reactions.
Models
- GNN Graph Neural Network, 2009
- DeepWalk: Online Learning of Social Representations, 2014
- GraphSage, 2017
- Relational inductive biases, DL, and graph networks, 2018
- KGCN: Knowledge Graph Convolutional Network, 2019
Survey papers
- A Gentle Introduction to GNN Medium, Feb 2019
- GNN: A Review of Methods and Applications: Dic 2018, last revised Jul 2019
- A Comprehensive Survey on GNN: Jan 2019, last revised Aug 2019
Application examples:
- Smell molecules
- Newton vs the machine: Solving the 3-body problem using DL (Not using graphs)

Semi-supervised DL

Check this kaggle discussion

Ladder Networks
GANs
Clustering like KMeans
Variational Autoencoder (VAE)
Pseudolabeling: Retrain with predicted test data as new labels.
label propagation and label spreading tutorial

Reinforcement Learning

Best resources:
- Openai spinning up: Probably the best one.
- Udacity repo: Good free repo for the paid course.
- theschool.ai move 37
- Reinforcement Learning: An Introduction: Best book
Q-learning
- DQN
Policy gradients
- A3C
C51
Rainbow
Implicit Quantile
Evolutionary Strategy
Genetic Algorithms

Reinforcement learning reference

RL’s foundational flaw

How to fix reinforcement learning

AlphaGoZero

Trust Region Policy Optimization

Introduction to RL Algorithms. Part I

Introduction to RL Algorithms. Part II

pytorch tutorial

RL Adventure

RL Adventure 2

DeepRL

Resources

Antor TODO

Automatic featuring engeniring

Fast.ai tabular: Not really works well
Problems:
- DL can not see frequency of an item
- Items that does not appear in the train set
Manually align 2 distributions:
- Microsoft Malware Prediction
- CPMP Solution: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84069

How start a competition/ML project

Data exploaration , haw is the data that we are going to work with
Think about input representation
- Is redundant?
- Need to be converted to somthing else?
- The most entropy that you can reconstruct the raw data
Look at the metric
- Makes sense?
- Is it differentiable
- Can i buid good enough metric equivalent
Build a toy model an overfit it with 1 or few samples
- To make sure that nothing is really broken

JPEG: 2 levels of comprehension:

Entropy
Choram

LIDAR

Projections (BAD REPRESENTATION) (complicated things with voxels) Dense matrix (antor) - Its a depth map i think - Not projections - NAtive output of the sensor but condensed in a dense matrix

Unordered set (point cloud, molecules)

Point net
transformer without positional encoding
- AtomTransformer (by antor)
- MoleculeTransformer (by antor)

TODO

Multi-Task Learning: Train a model on a variety of learning tasks

Meta-learning: Learn new tasks with minimal data using prior knowledge.

N-Shot Learning

Zero-shot: 0 trainning examples of that class.

One-shot: 1 trainning example of that class.

Few-shot: 2...5 trainning examples of that class.

Models

Naive approach: re-training the model on the new data, would severely overfit.

Siamese Networks (2015) Knows if to inputs are the same or not. (2 Feature extraction shares wights)

Matching Networks (2016) Weighted nearest-neighbor classifier applied within an embedding space.

Model-Agnostic Meta-Learning (MAML) (2017)

Prototypical Networks (2017): Better nearest-neighbor classifier of embeddings.

Meta-Learning for Semi-Supervised classification (2018) Extensions of Prototypical Networks. SotA.

Meta-Transfer Learning (MTL) (2018)

Online Meta-Learning (2019)

Neural Turing machine. paper, code

Neural Arithmetic Logic Units (NALU) paper

Remember the math:

Matrix calculus

Einsum: link 1, link 2

nvidia-smi daemon: Check that sm% is near to 100% for a good GPU usage.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

javiabellan / deep-learning

Programming Languages

Labels

Projects that are alternatives of or similar to deep-learning

🗂 Data

Balance the data

Split in train and validation

Normalization

Data augmentation

Image data aug

🧠 Model

Weight init

Activations

CoordConv

🧐 Regularization

Dropout

Weight regularization

DropConnect

Distillation

📉 Loss

Loss function

Label Smoothing

Referennce

📈 Metrics

Classification Metrics

🔥 Train

Learning Rate

Batch size

Number of epochs

Optimizers

Optimizers in Fast.ai

Set seed

Clean mem

Multiple GPUs

Reference

Half precision

Reference

✅ Production

Webserver

SERVER (Flask)

Run server

CLIENT (command line)

CLIENT (python)

Example server response

Quantization

3 options

Reference

Pruning

Percentage Pruning

Threshold Pruning

See pruning results

Reference

TorchScript

An intermediate representation of a PyTorch model

Reference

ONNX

Reference

🧐 Improve generalizationand avoid overfitting

(try in that order)

Other tricks:

🕓 Train faster

Supervised DL

Autoencoder

Neural Representations

Mejoras sobre el NeRF

Graph Neural Networks

Semi-supervised DL

Reinforcement Learning

Resources

Antor TODO

Automatic featuring engeniring

How start a competition/ML project

JPEG: 2 levels of comprehension:

LIDAR

Unordered set (point cloud, molecules)

TODO

🧐 Improve generalization
and avoid overfitting