All Projects → catalyst-team → classification

catalyst-team / classification

Licence: Apache-2.0 license
Catalyst.Classification

Programming Languages

shell
77523 projects
python
139335 projects - #7 most used programming language
Dockerfile
14818 projects
Makefile
30231 projects

Projects that are alternatives of or similar to classification

Segmentation
Catalyst.Segmentation
Stars: ✭ 27 (-22.86%)
Mutual labels:  pipeline, reproducibility, augmentation
Pytorch Toolbelt
PyTorch extensions for fast R&D prototyping and Kaggle farming
Stars: ✭ 942 (+2591.43%)
Mutual labels:  pipeline, image-classification, augmentation
Drake
An R-focused pipeline toolkit for reproducibility and high-performance computing
Stars: ✭ 1,301 (+3617.14%)
Mutual labels:  pipeline, reproducibility
Vistrails
VisTrails is an open-source data analysis and visualization tool. It provides a comprehensive provenance infrastructure that maintains detailed history information about the steps followed and data derived in the course of an exploratory task: VisTrails maintains provenance of data products, of the computational processes that derive these products and their executions.
Stars: ✭ 94 (+168.57%)
Mutual labels:  pipeline, reproducibility
Semsegpipeline
A simpler way of reading and augmenting image segmentation data into TensorFlow
Stars: ✭ 126 (+260%)
Mutual labels:  pipeline, augmentation
Mlj.jl
A Julia machine learning framework
Stars: ✭ 982 (+2705.71%)
Mutual labels:  pipeline, classification
Drake Examples
Example workflows for the drake R package
Stars: ✭ 57 (+62.86%)
Mutual labels:  pipeline, reproducibility
ML4K-AI-Extension
Use machine learning in AppInventor, with easy training using text, images, or numbers through the Machine Learning for Kids website.
Stars: ✭ 18 (-48.57%)
Mutual labels:  classification, image-classification
Open Solution Home Credit
Open solution to the Home Credit Default Risk challenge 🏡
Stars: ✭ 397 (+1034.29%)
Mutual labels:  pipeline, reproducibility
targets-tutorial
Short course on the targets R package
Stars: ✭ 87 (+148.57%)
Mutual labels:  pipeline, reproducibility
Automlpipeline.jl
A package that makes it trivial to create and evaluate machine learning pipeline architectures.
Stars: ✭ 223 (+537.14%)
Mutual labels:  pipeline, classification
MNIST
Handwritten digit recognizer using a feed-forward neural network and the MNIST dataset of 70,000 human-labeled handwritten digits.
Stars: ✭ 28 (-20%)
Mutual labels:  pipeline, image-classification
Steppy Toolkit
Curated set of transformers that make your work with steppy faster and more effective 🔭
Stars: ✭ 21 (-40%)
Mutual labels:  pipeline, reproducibility
Mlbox
MLBox is a powerful Automated Machine Learning python library.
Stars: ✭ 1,199 (+3325.71%)
Mutual labels:  pipeline, classification
TNCR Dataset
Deep learning, Convolutional neural networks, Image processing, Document processing, Table detection, Page object detection, Table classification. https://www.sciencedirect.com/science/article/pii/S0925231221018142
Stars: ✭ 37 (+5.71%)
Mutual labels:  classification, image-classification
Steppy
Lightweight, Python library for fast and reproducible experimentation 🔬
Stars: ✭ 119 (+240%)
Mutual labels:  pipeline, reproducibility
Targets
Function-oriented Make-like declarative workflows for R
Stars: ✭ 293 (+737.14%)
Mutual labels:  pipeline, reproducibility
Neural Pipeline
Neural networks training pipeline based on PyTorch
Stars: ✭ 315 (+800%)
Mutual labels:  pipeline, image-classification
Lightautoml
LAMA - automatic model creation framework
Stars: ✭ 196 (+460%)
Mutual labels:  pipeline, classification
targets-minimal
A minimal example data analysis project with the targets R package
Stars: ✭ 50 (+42.86%)
Mutual labels:  pipeline, reproducibility

Catalyst logo

Accelerated DL & RL!

Build Status CodeFactor Pipi version Docs PyPI Status

Twitter Telegram Slack Github contributors

PyTorch framework for Deep Learning research and development. It was developed with a focus on reproducibility, fast experimentation and code/ideas reusing. Being able to research/develop something new, rather than write another regular train loop.
Break the cycle - use the Catalyst!

Project manifest. Part of PyTorch Ecosystem. Part of Catalyst Ecosystem:

  • Alchemy - Experiments logging & visualization
  • Catalyst - Accelerated Deep Learning Research and Development
  • Reaction - Convenient Deep Learning models serving

Catalyst at AI Landscape.


Catalyst.Classification Build Status Github contributors

Note: this repo uses advanced Catalyst Config API and could be a bit out-of-day right now. Use Catalyst's minimal examples section for a starting point and up-to-day use cases, please.

You will learn how to build image classification pipeline with transfer learning using the Catalyst framework to get reproducible results.

Goals

  1. Install requirements
  2. Prepare data
  3. Run: raw data → production-ready model
  4. Get results
  5. Customize own pipeline

1. Install requirements

Using local environment:

pip install -r requirements/requirements.txt

Using docker:

This creates a build catalyst-classification with the necessary libraries:

make docker-build

2. Get Dataset

Try on open datasets

You can use one of the open datasets

export DATASET="artworks"

rm -rf data/
mkdir -p data

if [[ "$DATASET" == "ants_bees" ]]; then
    # https://www.kaggle.com/ajayrana/hymenoptera-data
    download-gdrive 1czneYKcE2sT8dAMHz3FL12hOU7m1ZkE7 ants_bees_cleared_190806.tar.gz
    tar -xf ants_bees_cleared_190806.tar.gz &>/dev/null
    mv ants_bees_cleared_190806 ./data/origin
elif [[ "$DATASET" == "flowers" ]]; then
    # https://www.kaggle.com/alxmamaev/flowers-recognition
    download-gdrive 1rvZGAkdLlbR_MEd4aDvXW11KnLaVRGFM flowers.tar.gz
    tar -xf flowers.tar.gz &>/dev/null
    mv flowers ./data/origin
elif [[ "$DATASET" == "artworks" ]]; then
    # https://www.kaggle.com/ikarus777/best-artworks-of-all-time
    download-gdrive 1eAk36MEMjKPKL5j9VWLvNTVKk4ube9Ml artworks.tar.gz
    tar -xf artworks.tar.gz &>/dev/null
    mv artworks ./data/origin
fi

Use your own dataset

Prepare your dataset

Data structure

Make sure, that final folder with data has the required structure:

/path/to/your_dataset/
        class_name_1/
            images
        class_name_2/
            images
        ...
        class_name_100500/
            ...

Data location

  • The easiest way is to move your data:

    mv /path/to/your_dataset/* /catalyst.classification/data/origin

    In that way you can run pipeline with default settings.

  • If you prefer leave data in /path/to/your_dataset/

    • In local environment:

      • Link directory
        ln -s /path/to/your_dataset $(pwd)/data/origin
      • Or just set path to your dataset DATADIR=/path/to/your_dataset when you start the pipeline.
    • Using docker

      You need to set:

         -v /path/to/your_dataset:/data \ #instead default  $(pwd)/data/origin:/data

      in the script below to start the pipeline.

3. Classification pipeline

Fast&Furious: raw data → production-ready model

The pipeline will automatically guide you from raw data to the production-ready model.

We will initialize ResNet-18 model with a pre-trained network. During current pipeline model will be trained sequentially in two stages, also in the first stage we will train several heads simultaneously.

Run in local environment:

CUDA_VISIBLE_DEVICES=0 \
CUDNN_BENCHMARK="True" \
CUDNN_DETERMINISTIC="True" \
bash ./bin/catalyst-classification-pipeline.sh \
  --workdir ./logs \
  --datadir ./data/origin \
  --max-image-size 224 \  # 224 or 448 works good
  --balance-strategy 256 \  # images in epoch per class, 1024 works good
  --config-template ./configs/templates/main.yml \
  --num-workers 4 \
  --batch-size 256 \
  --criterion CrossEntropyLoss  # one of CrossEntropyLoss, BCEWithLogits, FocalLossMultiClass

Run in docker:

docker run -it --rm --shm-size 8G --runtime=nvidia \
  -v $(pwd):/workspace/ \
  -v $(pwd)/logs:/logdir/ \
  -v $(pwd)/data/origin:/data \
  -e "CUDA_VISIBLE_DEVICES=0" \
  -e "CUDNN_BENCHMARK='True'" \
  -e "CUDNN_DETERMINISTIC='True'" \
  catalyst-classification ./bin/catalyst-classification-pipeline.sh \
    --workdir /logdir \
    --datadir /data \
    --max-image-size 224 \  # 224 or 448 works good
    --balance-strategy 256 \  # images in epoch per class, 1024 works good
    --config-template ./configs/templates/main.yml \
    --num-workers 4 \
    --batch-size 256 \
    --criterion CrossEntropyLoss  # one of CrossEntropyLoss, BCEWithLogits, FocalLossMultiClass

The pipeline is running and you don’t have to do anything else, it remains to wait for the best model!

Visualizations

You can use W&B account for visualisation right after pip install wandb:

wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results

Tensorboard also can be used for visualisation:

tensorboard --logdir=/catalyst.classification/logs

Confusion matrix

4. Results

All results of all experiments can be found locally in WORKDIR, by default catalyst.classification/logs. Results of experiment, for instance catalyst.classification/logs/logdir-191010-141450-c30c8b84, contain:

checkpoints

  • The directory contains all checkpoints: best, last, also of all stages.
  • best.pth and last.pht can be also found in the corresponding experiment in your W&B account.

configs

  • The directory contains experiment`s configs for reproducibility.

logs

  • The directory contains all logs of experiment.
  • Metrics also logs can be displayed in the corresponding experiment in your W&B account.

code

  • The directory contains code on which calculations were performed. This is necessary for complete reproducibility.

5. Customize own pipeline

For your future experiments framework provides powerful configs allow to optimize configuration of the whole pipeline of classification in a controlled and reproducible way.

Configure your experiments

  • Common settings of stages of training and model parameters can be found in catalyst.classification/configs/_common.yml.

    • model_params: detailed configuration of models, including:
      • model, for instance MultiHeadNet
      • detailed architecture description
      • using pretrained model
    • stages: you can configure training or inference in several stages with different hyperparameters. In our example:
      • optimizer params
      • first learn the head(s), then train the whole network
  • The CONFIG_TEMPLATE with other experiment`s hyperparameters, such as data_params and is here: catalyst.classification/configs/templates/main.yml. The config allows you to define:

    • data_params: path, batch size, num of workers and so on
    • callbacks_params: Callbacks are used to execute code during training, for example, to get metrics or save checkpoints. Catalyst provide wide variety of helpful callbacks also you can use custom.

You can find much more options for configuring experiments in catalyst documentation.

6. Autolabel

Goals

The classical way to reduce the amount of unlabeled data by having a trained model would be to run unlabeled dataset through the model and automatically label images with confidence of label prediction above the threshold. Then automatically labeled data pushing in the training process so as to optimize prediction accuracy.

To run the iteration process we need to specify number of iterations n-trials and threshold of confidence to label image.

  • tune ResNetEncoder
  • train MultiHeadNet for image classification
  • predict unlabelled dataset
  • use most confident predictions as true labels
  • repeat

Preparation

catalyst.classification/data/
    raw/
        all/
            ...
    clean/
        0/
            ...
        1/
            ...

Model training

Run in local environment:
CUDA_VISIBLE_DEVICES=0 \
CUDNN_BENCHMARK="True" \
CUDNN_DETERMINISTIC="True" \
bash ./bin/catalyst-autolabel-pipeline.sh \
  --workdir ./logs \
  --datadir-clean ./data/clean \
  --datadir-raw ./data/raw \
  --n-trials 10 \
  --threshold 0.8 \
  --config-template ./configs/templates/autolabel.yml \
  --max-image-size 224 \
  --num-workers 4 \
  --batch-size 256
Run in docker:
docker run -it --rm --shm-size 8G --runtime=nvidia \
  -v $(pwd):/workspace/ \
  -e "CUDA_VISIBLE_DEVICES=0" \
  -e CUDNN_BENCHMARK="True" \
  -e CUDNN_DETERMINISTIC="True" \
  catalyst-classification bash ./bin/catalyst-autolabel-pipeline.sh \
    --workdir ./logs \
    --datadir-clean ./data/clean \
    --datadir-raw ./data/raw \
    --n-trials 10 \
    --threshold 0.8 \
    --config-template ./configs/templates/autolabel.yml \
    --max-image-size 224 \
    --num-workers 4 \
    --batch-size 256

Results of autolabeling

Out:

Predicted: 23 (100.00%)
...
Pseudo Lgabeling done. Nothing more to label.

Logs for trainings visualisation can be found here: ./logs/autolabel

Labeled raw data can be found here: /data/data_clean/dataset.csv

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].