All Projects → ucbrise → flor

ucbrise / flor

Licence: Apache-2.0 license
FLOR: Fast Low-Overhead Recovery. FLOR lets you log ML training data post-hoc, with hindsight.

Programming Languages

python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
HTML
75241 projects
Jupyter Notebook
11667 projects
CSS
56736 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to flor

neptune-client
📒 Experiment tracking tool and model registry
Stars: ✭ 348 (+182.93%)
Mutual labels:  logger, ml
Torch-Scope
A Toolkit for Training, Tracking, Saving Models and Syncing Results
Stars: ✭ 62 (-49.59%)
Mutual labels:  logger, tensorboard
plinycompute
A system for development of high-performance, data-intensive, distributed computing, applications, tools, and libraries.
Stars: ✭ 27 (-78.05%)
Mutual labels:  ml
webpack-log
A logger for the Webpack ecosystem
Stars: ✭ 18 (-85.37%)
Mutual labels:  logger
guzzle-logger
Automatically log all API calls
Stars: ✭ 42 (-65.85%)
Mutual labels:  logger
deeplearning-paper-notes
Reading notes on deep learning papers---深度学习论文阅读笔记
Stars: ✭ 36 (-70.73%)
Mutual labels:  ml
android-sdk
AppSpector is a debugging service for mobile apps
Stars: ✭ 39 (-68.29%)
Mutual labels:  logger
klog
KLog is a multiplatform free hamradio logger. It runs natively on Linux, macOS and Windows.
Stars: ✭ 31 (-74.8%)
Mutual labels:  logger
KmLogging
Kotlin multiplatform logging. High performance, composable and simple to use.
Stars: ✭ 21 (-82.93%)
Mutual labels:  logger
zpy
Synthetic data for computer vision. An open source toolkit using Blender and Python.
Stars: ✭ 251 (+104.07%)
Mutual labels:  ml
tfsum
Enable TensorBoard for TensorFlow Go API
Stars: ✭ 32 (-73.98%)
Mutual labels:  tensorboard
flutter-vision
iOS and Android app built with Flutter and Firebase. Includes Firebase ML Vision, Firestore, and Storage
Stars: ✭ 45 (-63.41%)
Mutual labels:  ml
youtube-or-pornhub
Service identification on ciphered traffic.
Stars: ✭ 26 (-78.86%)
Mutual labels:  ml
colabs
This repository holds the Google Colabs for the EdX TinyML Specialization
Stars: ✭ 73 (-40.65%)
Mutual labels:  ml
jovian-py
Collaboration platform for data science projects & Jupyter notebooks
Stars: ✭ 91 (-26.02%)
Mutual labels:  ml
kglib
TypeDB-ML is the Machine Learning integrations library for TypeDB
Stars: ✭ 523 (+325.2%)
Mutual labels:  ml
ZLToolKit
一个基于C++11的轻量级网络框架,基于线程池技术可以实现大并发网络IO
Stars: ✭ 1,302 (+958.54%)
Mutual labels:  logger
node-perj
A fast, flexible JSON logger.
Stars: ✭ 16 (-86.99%)
Mutual labels:  logger
hana-ml-samples
This project provides code examples for SAP HANA Predictive and Machine Learning scenarios and is educational content. It covers simple Predictive Analysis Library SQL examples as well as complete SAP HANA design-time “ML scenario”-application content or HANA-ML Python Notebook examples.
Stars: ✭ 67 (-45.53%)
Mutual labels:  ml
mildnet
Visual Similarity research at Fynd. Contains code to reproduce 2 of our research papers.
Stars: ✭ 76 (-38.21%)
Mutual labels:  ml

FLOR: Fast Low-Overhead Recovery

You can use FLOR to take checkpoints during model training. These checkpoints allow you to restore arbitrary training data post-hoc and efficiently, thanks to memoization and parallelism speedups on replay.

FLOR is a suite of machine learning tools for hindsight logging. Hindsight logging is an optimistic logging practice favored by agile model developers. Model developers log training metrics such as the loss and accuracy by default, and selectively restore additional training data --- like tensor histograms, images, and overlays --- post-hoc, if and when there is evidence of a problem.

FLOR is software developed at UC Berkeley's RISE Lab, and is being released as part of an accompanying VLDB publication.

Installation

pip install pyflor

FLOR expects a recent version of Python (3.7+) and PyTorch (1.0+).

git branch flor.shadow
git checkout flor.shadow
python3 examples/linear.py --flor linear

Run the linear.py script to test your installation. This script will train a small linear model on MNIST. Think of it as a ''hello world'' of deep learning. We will cover FLOR shadow branches later.

ls ~/.flor/linear

Confirm that FLOR saved checkpoints of the linear.py execution on your home directory. FLOR will access and interpret contents of ~/.flor automatically. Do watch out for storage footprint though. If you see disk space running out, check ~/.flor. FLOR includes utilities for spooling its checkpoints to S3.

Preparing your Training Script

import flor
for epoch in flor.it(range(...)):
    ...

First, wrap the iterator of the main loop with FLOR's generator: flor.it. The generator enables FLOR to parallelize replay of the main loop, and to jump to an arbitrary epoch for data recovery. FLOR also relies on this generator for initialization and clean-up, so don't skip this step.

import flor
import torch

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

for epoch in flor.it(range(...)):
    if flor.SkipBlock.step_into('training_loop'):
        for data in trainloader:
            inputs, labels = data
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            print(f"loss: {loss.item()}")
    flor.SkipBlock.end(net, optimizer)
    eval(net, testloader)

Then, wrap the nested training loop inside a flor.SkipBlock as shown above. Add the stateful torch objects to flor.SkipBlock.end so FLOR checkpoints them periodically.

You can use SkipBlocks to memoize long-running code. Just make sure you give each SkipBlock a unique name (e.g. training_loop).

That's it! Your code is now ready for record-replay.

Training your model

git checkout flor.shadow
python3 training_script.py --flor NAME [your_script_flags]

Before we train your model, make sure that your model training code is part of a git repository. Model training is exploratory and it's common to iterate dozens of times before finding the right fit. We'd hate for you to be manually responsible for managing all those versions. Instead, we ask you to create a FLOR shadow branch that we can automatically commit changes to. Think of it as a sandbox: you get the benefits of autosaving, without worrying about us poluting your main branch with frequent & automatic commits. Later, you can merge the changes you like.

In FLOR, all experiments need a name. As your training scripts and configurations evolve, keep the same experiment name so FLOR associates the checkpoints as versions of the same experiment. If you want to re-use the name from the previous run, you may leave the field blank.

Hindsight Logging

import flor
import torch

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

for epoch in flor.it(range(...)):
    if flor.SkipBlock.step_into('training_loop'):
        ...
    flor.SkipBlock.end(net, optimizer)
    eval(net, testloader)
    log_confusion_matrix(net, testloader)

Suppose you want to view a confusion matrix as it changes throughout training. Add the code to generate the confusion matrix, as sugared above.

git checkout flor.shadow
python3 training_script.py --replay_flor

You first switch to the FLOR shadow branch, and select the version you wish to replay from the git log list. In our example, we won't checkout version, because we want to replay the latest version, which is selected by default.

You will tell FLOR to replay by setting the flag --replay_flor. FLOR is performing fast replay, so you may generalize this example to recover ad-hoc training data. In our example, FLOR will compute your confusion matrix and automatically skip the nested training loop by loading its checkpoints.

import flor
import torch

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

for epoch in flor.it(range(...)):
    if flor.SkipBlock.step_into('training_loop', probed=True):
        ...
        log_tensor_histograms(net.parameters())
    flor.SkipBlock.end(net, optimizer)
    eval(net, testloader)
    log_confusion_matrix(net, testloader)

Now, suppose you also want TensorBoard to plot the tensor histograms. In this case, it is not possible to skip the nested training loop because we are probing intermediate data. We tell FLOR to step into the nested training loop by setting probed=True (an argument to the training loop's SkipBlock).

Although we can't skip the nested training loop, we can parallelize replay or re-execute just a fraction of the epochs (e.g. near the epoch where we see a loss anomaly).

git checkout flor.shadow
python3 training_script.py --replay_flor PID/NGPUS [your_flags]

As before, you tell FLOR to run in replay mode by setting --replay_flor. You'll also tell FLOR how many GPUs from the pool to use for parallelism, and you'll dispatch this script simultaneously, varying the pid:<int> to span all the GPUs. To run segment 3 out of 5 segments, you would write: --replay_flor 3/5.

If instead of replaying all of training you wish to re-execute only a fraction of the epochs you can do this by setting the value of ngpus and pid respectively. Suppose you want to run the tenth epoch of a training job that ran for 200 epochs. You would set pid:9and ngpus:200.

We provide additional examples in the examples directory. A good starting point is linear.py.

Publications

To cite this work, please refer to the Hindsight Logging paper (VLDB '21).

FLOR is open source software developed at UC Berkeley. Joe Hellerstein (databases), Joey Gonzalez (machine learning), and Koushik Sen (programming languages) are the primary faculty members leading this work.

This work is released as part of Rolando Garcia's doctoral dissertation at UC Berkeley, and has been the subject of study by Eric Liu and Anusha Dandamudi, both of whom completed their master's theses on FLOR. Our list of publications are reproduced below. Finally, we thank Vikram Sreekanti, Dan Crankshaw, and Neeraja Yadwadkar for guidance, comments, and advice. Bobby Yan was instrumental in the development of FLOR and its corresponding experimental evaluation.

License

FLOR is licensed under the Apache v2 License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].