Healthcare-Researcher-Connector (HRC) Package:

A Federated Learning repository for simulating `decentralized training` for common biomedical use-cases

Mentors : Anton Kulaga, Ivan Shcheklein, Dmitry Petrov, Vladyslava Tyshchenko, Dmitry Nowicki

About
- Intent
- GSoC Blog Post
Installation and Initialization
Local Execution
Remote Execution
- Server Side
- Client Side
Running DVC stages
Notebooks
Hyperparameter Optimization
Testing
Known Issues
Tutorials / References
Project Status
Acknowledgements

About

Quality information exist as islands on gadgets like cell phones and PCs over the globe and are protected by severe security safeguarding laws.
Federated Learning gives an astute methods for associating AI models to these incoherent information paying little heed to their areas, and all the more significantly, without penetrating protection laws.
In biomedical research, sharing and use of human biomedical data is also heavily restricted and regulated by multiple laws. Such data-sharing restrictions allow keeping privacy of the patients but at the same time it impedes the pace of biomedical research, slows down the development of treatments of various diseases and often costs human lives.
COVID-19 pandemic is unfortunately a good illustration of how inaccessibility of clinical training data leads to casualties that can be otherwise avoided.
This repository is devoted to addressing this issue for the most common biomedical use-cases, like gene expression data.

Intent

It is an introductory project for simulating easy-to-deploy Federated Learning, for decentralized biomedical datasets.
- A user can either simulate FL training locally (using localhost), or remotely (on several machines).
- A user can also compare centralized vs decentralized train metrics.
Technology Stack used:
- OpenMined: PySyft, PyGrid
- DVC
- PyTorch
- Docker
Example Dataset used:
- GTEx: The Common Fund's Genotype-Tissue Expression (GTEx) Program established a data resource and tissue bank to study the relationship between genetic variants (inherited changes in DNA sequence) and gene expression (how genes are turned on and off) in multiple human tissues and across individuals.

GSoC Blog Post

GSoC Journey 2020

Installation and Initialization

NOTE: All the testing has been done on a MacOS / Linux based system

Step 1: Install Docker & Docker-Compose, and pull required images from DockerHub
1. To install Docker, just follow the docker documentation.
2. To install Docker-Compose, just follow the docker-compose documentation.
3. Start your docker daemon
4. Pull grid-node image : docker pull srijanverma44/grid-node:v028
5. Pull grid-network image : docker pull srijanverma44/grid-network:v028
- Image size of grid-node ~= 2GB, and that of grid-network ~= 300MB. That is, image sizes are large!
- NOTE: These images have been taken from OpenMined Stack. Refer PySyft & PyGrid repositories for more details!
Step 2: Install dependencies via conda
1. Install Miniconda, for your operating system, from https://conda.io/miniconda.html
2. git clone https://github.com/vermasrijan/srijan-gsoc-2020.git
3. cd srijan-gsoc-2020
4. conda env create -f environment.yml
5. conda activate pysyft_v028 (or source activate pysyft_v028 for older versions of conda)
Step 3: Install GTEx V8 Dataset
- Pull samples and expressions data using the following command:

dvc pull

The above command will download GTEx samples + expressions data inside data/gtex directory, from Google Drive remote repository.

Initially, you may be prompted to enter a verification code, i.e., you'll have to give DVC an access to your Google Drive API.

For that, go to the URL which may be displayed on your CLI, copy the code, paste it at CLI and press Enter. (For more info, refer 1 & 2)

Local execution

Usage

src/initializer.py is a python script for initializing either a centralized training, or a decentralized one.
This script will create a compose yaml file, initialize client/network containers, execute FL/centralized training and will finally stop running containers (for network/nodes).

Make sure your docker daemon is running
Run the following command -
- python src/initializer.py

Usage: initializer.py [OPTIONS]

Options:
  --samples_path TEXT       Input path for samples
  --expressions_path TEXT   Input for expressions
  --train_type TEXT         Either centralized or decentralized fashion
  --dataset_size INTEGER    Size of data for training
  --split_type TEXT         balanced / unbalanced / iid / non_iid
  --split_size FLOAT        Train / Test Split
  --n_epochs INTEGER        No. of Epochs / Rounds
  --metrics_path TEXT       Path to save metrics
  --model_save_path TEXT    Path to save trained models
  --metrics_file_name TEXT  Custom name for metrics file
  --no_of_clients INTEGER   Clients / Nodes for decentralized training
  --swarm TEXT              Option for switching between docker compose vs docker stack
  --no_cuda TEXT            no_cuda = True means not to use CUDA. Default --> use CPU
  --tags TEXT               Give tags for the data, which is to be sent to the nodes
  --node_start_port TEXT    Start port No. for a node
  --grid_address TEXT       grid address for network
  --grid_port TEXT          grid port for network
  --help                    Show this message and exit.

Centralized Training

Example command:

python src/initializer.py --train_type centralized --dataset_size 17000 --n_epochs 50

Centralized training example output, using 50 epochs:

============================================================
----<DATA PREPROCESSING STARTED..>----
----<STARTED TRAINING IN A centralized FASHION..>----
DATASET SIZE: 17000
Epoch: 0 Training loss: 0.00010540  | Training Accuracy: 0.1666
Epoch: 1 Training loss: 0.00010540  | Training Accuracy: 0.1669
.
.
Epoch: 48 Training loss: 9.3619e-05  | Training Accuracy: 0.4356
Epoch: 49 Training loss: 9.3567e-05  | Training Accuracy: 0.4359
---<SAVING METRICS.....>----
============================================================
OVERALL RUNTIME: 43.217 seconds

DVC Centralized Stage

dvc repro centralized_train

Decentralized Training

Example command:

python src/initializer.py --train_type decentralized --dataset_size 17000 --n_epochs 50 --no_of_clients 2

Decentralized training example output, using 50 epochs:

Distributed information, like total no. of samples with each client, will be displayed first.

============================================================
----<DATA PREPROCESSING STARTED..>----
----<STARTED TRAINING IN A decentralized FASHION..>----
DATASET SIZE: 17000
TOTAL CLIENTS: 2
DATAPOINTS WITH EACH CLIENT:
client_h1: 8499 ; Label Count: {0: 1445, 1: 1438, 2: 1429, 3: 1432, 4: 1394, 5: 1361}
client_h2: 8499 ; Label Count: {0: 1388, 1: 1395, 2: 1404, 3: 1401, 4: 1439, 5: 1472}
---<STARTING DOCKER IMAGE>----
====DOCKER STARTED!=======
Go to the following addresses: ['http://0.0.0.0:5000', 'http://0.0.0.0:5000/connected-nodes', 'http://0.0.0.0:5000/search-available-tags', 'http://0.0.0.0:3000', 'http://0.0.0.0:3001']
Press Enter to continue...
-------<USING CPU FOR TRAINING>-------
WORKERS:  ['h1', 'h2']
Train Epoch: 0 | With h2 data |: [8499/16998 (50%)]	    Train Loss: 0.000211 | Train Acc: 0.164
Train Epoch: 0 | With h1 data |: [16998/16998 (100%)]	Train Loss: 0.000211 | Train Acc: 0.192
Train Epoch: 1 | With h2 data |: [8499/16998 (50%)]	    Train Loss: 0.000211 | Train Acc: 0.172
Train Epoch: 1 | With h1 data |: [16998/16998 (100%)]	Train Loss: 0.000211 | Train Acc: 0.229
.
.
Train Epoch: 49 | With h2 data |: [8499/16998 (50%)]	Train Loss: 0.000187 | Train Acc: 0.384
Train Epoch: 49 | With h1 data |: [16998/16998 (100%)]	Train Loss: 0.000187 | Train Acc: 0.389
---<STOPPING DOCKER NODE/NETWORK CONTAINERS>----
381c4f79fb5c
c203c2f6fd62
1d3ccce7f732
---<SAVING METRICS.....>----
============================================================
OVERALL RUNTIME: 380.418 seconds

DVC Decentralized Stage

dvc repro decentralized_train

Metrics

NOTE: By default, metrics will be saved in data/metrics directory.
You can pass in the --metrics_path <path> flag to change the default directory.

Localhosts Example Screenshots

Following is what you may see at http://0.0.0.0:5000
Following is what you may see at http://0.0.0.0:5000/connected-nodes
Following is what you may see at http://0.0.0.0:5000/search-available-tags
Following is what you may see at http://0.0.0.0:3000

Remote Execution

Make sure all Firewalls are disabled on both, client and server side.

Docker-compose will be required in this section.

Server Side

docker-compose -f gridnetwork-compose.yml up

Client Side

STEP 1: Configure the environment variable called NETWORK, and replace it with <SERVER_IP_ADDRESS>
STEP 2: docker-compose -f gridnode-compose.yml up. You can edit this compose file to add more clients, if you'd like.

NOTE: Remote execution has not yet been tested properly.

In Progress...

Running DVC stages

DVC stages are in dvc.yaml file, to run dvc stage just use dvc repro <stage_name>

Notebooks

Notebooks, given in this repository, simulate decentralized training using 2 clients.

Docker-compose will be required in this section as well!

STEP 1: docker-compose -f notebook-docker-compose.yml up
STEP 2: conda activate pysyft_v028 (or source activate pysyft_v028 for older versions of conda)
STEP 3: Go to the following addresses:

['http://0.0.0.0:5000', 'http://0.0.0.0:5000/connected-nodes', 'http://0.0.0.0:5000/search-available-tags', 'http://0.0.0.0:3000', 'http://0.0.0.0:3001']

STEP 4: Initialize jupyter lab
STEP 5: Run data owner notebook: notebooks/data-owner_GTEx.ipynb
STEP 6: Run model owner notebook: notebooks/model-owner_GTEx.ipynb
STEP 7: STOP Node/Network running containers:

docker rm $(docker stop $(docker ps -a -q --filter ancestor=srijanverma44/grid-network:v028 --format="{{.ID}}"))

docker rm $(docker stop $(docker ps -a -q --filter ancestor=srijanverma44/grid-node:v028 --format="{{.ID}}"))

NOTE:

Notebooks given in this repository have been taken from this branch and have been modified.

Hyperparameter Optimization

python src/tune.py --help

In Progress...

Testing

Test Centralized training:

dvc repro centralized_test

Test Decentralized training:

dvc repro decentralized_test

Known Issues

While creating an environment:
- While creating an env. on a linux machine, you may get the following error: No space left on device. (refer here)
- Solution:
  - export TMPDIR=$HOME/tmp (i.e. change /tmp directory location)
  - mkdir -p $TMPDIR
  - source ~/.bashrc , and then run the following command -
  - conda env create -f environment.yml
While training:
- Some errors while training in a decentralized way:
  - ImportError: sys.meta_path is None, Python is likely shutting down
  - Solution - NOT YET RESOLVED!
Notebooks:
- Data transmission rate (i.e, sending large-sized tensors to the nodes) may be slow. (refer this)

Tutorials / References

OpenMined Welcome Page, high level organization and projects
OpenMined full stack, well explained
Understanding PyGrid and the use of data-centric FL
OpenMined RoadMap
What is PyGrid demo
Iterative, DVC: Data Version Control - Git for Data & Models (2020) DOI:10.5281/zenodo.012345.
iterative.ai
DVC Tutorials

Project Status

Under Development: Please note that the project is in its early development stage and all the features have not been tested yet.

Acknowledgements

I would like to thank all my mentors for taking the time to mentor me and for their invaluable suggestions throughout. I truly appreciate their constant trust and encouragement!
Open Bioinformatics Foundation admins, helpdesk and the whole community
OpenMined Community, for putting together such a beautiful tech stack and for their constant help throughout!
Systems Biology of Aging Group, for providing me with useful resources, for trusting me throughout and for their constant feedback!
Iterative.ai and DVC, for making all of our lives so much more easier now :)
GSoC organizers, managers and Google.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

vermasrijan / srijan-gsoc-2020

Programming Languages

Labels

Projects that are alternatives of or similar to srijan-gsoc-2020

Healthcare-Researcher-Connector (HRC) Package:

A Federated Learning repository for simulating `decentralized training` for common biomedical use-cases

Mentors : Anton Kulaga, Ivan Shcheklein, Dmitry Petrov, Vladyslava Tyshchenko, Dmitry Nowicki

Table of Contents

About

Intent

GSoC Blog Post

Installation and Initialization

Local execution

Usage

Centralized Training

DVC Centralized Stage

Decentralized Training

DVC Decentralized Stage

Metrics

Localhosts Example Screenshots

Remote Execution

Server Side

Client Side

Running DVC stages

Notebooks

Hyperparameter Optimization

Testing

Known Issues

Tutorials / References

Project Status

Acknowledgements

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

vermasrijan / srijan-gsoc-2020

Programming Languages

Labels

Projects that are alternatives of or similar to srijan-gsoc-2020

Healthcare-Researcher-Connector (HRC) Package:

A Federated Learning repository for simulating decentralized training for common biomedical use-cases

Mentors : Anton Kulaga, Ivan Shcheklein, Dmitry Petrov, Vladyslava Tyshchenko, Dmitry Nowicki

Table of Contents

About

Intent

GSoC Blog Post

Installation and Initialization

Local execution

Usage

Centralized Training

DVC Centralized Stage

Decentralized Training

DVC Decentralized Stage

Metrics

Localhosts Example Screenshots

Remote Execution

Server Side

Client Side

Running DVC stages

Notebooks

Hyperparameter Optimization

Testing

Known Issues

Tutorials / References

Project Status

Acknowledgements

A Federated Learning repository for simulating `decentralized training` for common biomedical use-cases