All Projects → vermasrijan → srijan-gsoc-2020

vermasrijan / srijan-gsoc-2020

Licence: MIT license
Healthcare-Researcher-Connector Package: Federated Learning tool for bridging the gap between Healthcare providers and researchers

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to srijan-gsoc-2020

gsoc
Project Tracker for GSoC 2020: Creating Quality models using @grimoirelab and @chaoss metrics
Stars: ✭ 21 (+23.53%)
Mutual labels:  google-summer-of-code, gsoc, gsoc-2020
PFL-Non-IID
The origin of the Non-IID phenomenon is the personalization of users, who generate the Non-IID data. With Non-IID (Not Independent and Identically Distributed) issues existing in the federated learning setting, a myriad of approaches has been proposed to crack this hard nut. In contrast, the personalized federated learning may take the advantage…
Stars: ✭ 58 (+241.18%)
Mutual labels:  differential-privacy, federated-learning
The-Beginners-Guide-to-Google-Summer-of-Code-GSoC
The Beginners Guide to Google Summer of Code (GSoC)
Stars: ✭ 33 (+94.12%)
Mutual labels:  google-summer-of-code, gsoc
Google-Summer-of-Code-with-SymPy
This repository showcases my proposal, final report, and the work done during Google Summer of Code 2020 with the SymPy project.
Stars: ✭ 12 (-29.41%)
Mutual labels:  gsoc, gsoc-2020
federated
Bachelor's Thesis in Computer Science: Privacy-Preserving Federated Learning Applied to Decentralized Data
Stars: ✭ 25 (+47.06%)
Mutual labels:  differential-privacy, federated-learning
git-task-list
Git Task Lists
Stars: ✭ 25 (+47.06%)
Mutual labels:  google-summer-of-code, gsoc
Awesome-Federated-Machine-Learning
Everything about federated learning, including research papers, books, codes, tutorials, videos and beyond
Stars: ✭ 190 (+1017.65%)
Mutual labels:  differential-privacy, federated-learning
CCAligner
🔮 Word by word audio subtitle synchronisation tool and API. Developed under GSoC 2017 with CCExtractor.
Stars: ✭ 131 (+670.59%)
Mutual labels:  google-summer-of-code, gsoc
gsoc-proposals-archive
This repository contains Accepted proposals for various Google Summer of Code organizations throughout various years!
Stars: ✭ 295 (+1635.29%)
Mutual labels:  google-summer-of-code, gsoc
federated pca
Federated Principal Component Analysis Revisited!
Stars: ✭ 30 (+76.47%)
Mutual labels:  differential-privacy, federated-learning
covid19-pr-api
COVID-19 Open API for Datasets in Puerto Rico
Stars: ✭ 21 (+23.53%)
Mutual labels:  covid-19
gsoc-2022
List of project ideas for contributors applying to the Google Summer of Code program in 2022 (GSoC 2022).
Stars: ✭ 44 (+158.82%)
Mutual labels:  gsoc
covid-19
Data ETL & Analysis on the global and Mexican datasets of the COVID-19 pandemic.
Stars: ✭ 14 (-17.65%)
Mutual labels:  covid-19
covid19-italy
Quick streamlit dashboard to visualise the impact of COVID-19 in Italy
Stars: ✭ 24 (+41.18%)
Mutual labels:  covid-19
flutter news app
Simple and modern news app that incorporates REST API (newsapi.org), all built entirely with Flutter. 🚀
Stars: ✭ 69 (+305.88%)
Mutual labels:  covid-19
Cough-signal-processing
Different methods and techniques for features extraction from audio
Stars: ✭ 42 (+147.06%)
Mutual labels:  covid-19
covid19.MIScnn
Robust Chest CT Image Segmentation of COVID-19 Lung Infection based on limited data
Stars: ✭ 77 (+352.94%)
Mutual labels:  covid-19
coronavirus-data
This repository contains data on Coronavirus Disease 2019 (COVID-19) in New York City (NYC), from the NYC Department of Health and Mental Hygiene.
Stars: ✭ 926 (+5347.06%)
Mutual labels:  covid-19
PyVertical
Privacy Preserving Vertical Federated Learning
Stars: ✭ 133 (+682.35%)
Mutual labels:  federated-learning
farolcovid
🚦🏥. Ferramenta de monitoramento do risco de colapso no sistema de saúde em municípios brasileiros com a Covid-19 • Monitoring tool & simulation of the risk of collapse in Brazilian municipalities' health system due to Covid-19
Stars: ✭ 49 (+188.24%)
Mutual labels:  covid-19

Healthcare-Researcher-Connector (HRC) Package:

A Federated Learning repository for simulating decentralized training for common biomedical use-cases

Build Status contributions welcome GitHub license

Mentors : Anton Kulaga, Ivan Shcheklein, Dmitry Petrov, Vladyslava Tyshchenko, Dmitry Nowicki

Table of Contents

About

  • Quality information exist as islands on gadgets like cell phones and PCs over the globe and are protected by severe security safeguarding laws.
  • Federated Learning gives an astute methods for associating AI models to these incoherent information paying little heed to their areas, and all the more significantly, without penetrating protection laws.
  • In biomedical research, sharing and use of human biomedical data is also heavily restricted and regulated by multiple laws. Such data-sharing restrictions allow keeping privacy of the patients but at the same time it impedes the pace of biomedical research, slows down the development of treatments of various diseases and often costs human lives.
  • COVID-19 pandemic is unfortunately a good illustration of how inaccessibility of clinical training data leads to casualties that can be otherwise avoided.
  • This repository is devoted to addressing this issue for the most common biomedical use-cases, like gene expression data.

Intent

  • It is an introductory project for simulating easy-to-deploy Federated Learning, for decentralized biomedical datasets.
    • A user can either simulate FL training locally (using localhost), or remotely (on several machines).
    • A user can also compare centralized vs decentralized train metrics.
  • Technology Stack used:
  • Example Dataset used:
    • GTEx: The Common Fund's Genotype-Tissue Expression (GTEx) Program established a data resource and tissue bank to study the relationship between genetic variants (inherited changes in DNA sequence) and gene expression (how genes are turned on and off) in multiple human tissues and across individuals.

GSoC Blog Post

Installation and Initialization

  • NOTE: All the testing has been done on a MacOS / Linux based system
  • Step 1: Install Docker & Docker-Compose, and pull required images from DockerHub
    1. To install Docker, just follow the docker documentation.
    2. To install Docker-Compose, just follow the docker-compose documentation.
    3. Start your docker daemon
    4. Pull grid-node image : docker pull srijanverma44/grid-node:v028
    5. Pull grid-network image : docker pull srijanverma44/grid-network:v028
    • Image size of grid-node ~= 2GB, and that of grid-network ~= 300MB. That is, image sizes are large!
    • NOTE: These images have been taken from OpenMined Stack. Refer PySyft & PyGrid repositories for more details!
  • Step 2: Install dependencies via conda
    1. Install Miniconda, for your operating system, from https://conda.io/miniconda.html
    2. git clone https://github.com/vermasrijan/srijan-gsoc-2020.git
    3. cd srijan-gsoc-2020
    4. conda env create -f environment.yml
    5. conda activate pysyft_v028 (or source activate pysyft_v028 for older versions of conda)
  • Step 3: Install GTEx V8 Dataset
    • Pull samples and expressions data using the following command:
dvc pull
  • The above command will download GTEx samples + expressions data inside data/gtex directory, from Google Drive remote repository.
  • Initially, you may be prompted to enter a verification code, i.e., you'll have to give DVC an access to your Google Drive API.
  • For that, go to the URL which may be displayed on your CLI, copy the code, paste it at CLI and press Enter. (For more info, refer 1 & 2)

Local execution

Usage

  • src/initializer.py is a python script for initializing either a centralized training, or a decentralized one.
  • This script will create a compose yaml file, initialize client/network containers, execute FL/centralized training and will finally stop running containers (for network/nodes).
  1. Make sure your docker daemon is running
  2. Run the following command -
    • python src/initializer.py
Usage: initializer.py [OPTIONS]

Options:
  --samples_path TEXT       Input path for samples
  --expressions_path TEXT   Input for expressions
  --train_type TEXT         Either centralized or decentralized fashion
  --dataset_size INTEGER    Size of data for training
  --split_type TEXT         balanced / unbalanced / iid / non_iid
  --split_size FLOAT        Train / Test Split
  --n_epochs INTEGER        No. of Epochs / Rounds
  --metrics_path TEXT       Path to save metrics
  --model_save_path TEXT    Path to save trained models
  --metrics_file_name TEXT  Custom name for metrics file
  --no_of_clients INTEGER   Clients / Nodes for decentralized training
  --swarm TEXT              Option for switching between docker compose vs docker stack
  --no_cuda TEXT            no_cuda = True means not to use CUDA. Default --> use CPU
  --tags TEXT               Give tags for the data, which is to be sent to the nodes
  --node_start_port TEXT    Start port No. for a node
  --grid_address TEXT       grid address for network
  --grid_port TEXT          grid port for network
  --help                    Show this message and exit.

Centralized Training

  • Example command:
python src/initializer.py --train_type centralized --dataset_size 17000 --n_epochs 50        
  • Centralized training example output, using 50 epochs:
============================================================
----<DATA PREPROCESSING STARTED..>----
----<STARTED TRAINING IN A centralized FASHION..>----
DATASET SIZE: 17000
Epoch: 0 Training loss: 0.00010540  | Training Accuracy: 0.1666
Epoch: 1 Training loss: 0.00010540  | Training Accuracy: 0.1669
.
.
Epoch: 48 Training loss: 9.3619e-05  | Training Accuracy: 0.4356
Epoch: 49 Training loss: 9.3567e-05  | Training Accuracy: 0.4359
---<SAVING METRICS.....>----
============================================================
OVERALL RUNTIME: 43.217 seconds

DVC Centralized Stage

dvc repro centralized_train

Decentralized Training

  • Example command:
python src/initializer.py --train_type decentralized --dataset_size 17000 --n_epochs 50 --no_of_clients 2     
  • Decentralized training example output, using 50 epochs:
  • Distributed information, like total no. of samples with each client, will be displayed first.
============================================================
----<DATA PREPROCESSING STARTED..>----
----<STARTED TRAINING IN A decentralized FASHION..>----
DATASET SIZE: 17000
TOTAL CLIENTS: 2
DATAPOINTS WITH EACH CLIENT:
client_h1: 8499 ; Label Count: {0: 1445, 1: 1438, 2: 1429, 3: 1432, 4: 1394, 5: 1361}
client_h2: 8499 ; Label Count: {0: 1388, 1: 1395, 2: 1404, 3: 1401, 4: 1439, 5: 1472}
---<STARTING DOCKER IMAGE>----
====DOCKER STARTED!=======
Go to the following addresses: ['http://0.0.0.0:5000', 'http://0.0.0.0:5000/connected-nodes', 'http://0.0.0.0:5000/search-available-tags', 'http://0.0.0.0:3000', 'http://0.0.0.0:3001']
Press Enter to continue...
-------<USING CPU FOR TRAINING>-------
WORKERS:  ['h1', 'h2']
Train Epoch: 0 | With h2 data |: [8499/16998 (50%)]	    Train Loss: 0.000211 | Train Acc: 0.164
Train Epoch: 0 | With h1 data |: [16998/16998 (100%)]	Train Loss: 0.000211 | Train Acc: 0.192
Train Epoch: 1 | With h2 data |: [8499/16998 (50%)]	    Train Loss: 0.000211 | Train Acc: 0.172
Train Epoch: 1 | With h1 data |: [16998/16998 (100%)]	Train Loss: 0.000211 | Train Acc: 0.229
.
.
Train Epoch: 49 | With h2 data |: [8499/16998 (50%)]	Train Loss: 0.000187 | Train Acc: 0.384
Train Epoch: 49 | With h1 data |: [16998/16998 (100%)]	Train Loss: 0.000187 | Train Acc: 0.389
---<STOPPING DOCKER NODE/NETWORK CONTAINERS>----
381c4f79fb5c
c203c2f6fd62
1d3ccce7f732
---<SAVING METRICS.....>----
============================================================
OVERALL RUNTIME: 380.418 seconds

DVC Decentralized Stage

dvc repro decentralized_train

Metrics

  • NOTE: By default, metrics will be saved in data/metrics directory.
  • You can pass in the --metrics_path <path> flag to change the default directory.

Localhosts Example Screenshots

  1. Following is what you may see at http://0.0.0.0:5000
  2. Following is what you may see at http://0.0.0.0:5000/connected-nodes
  3. Following is what you may see at http://0.0.0.0:5000/search-available-tags
  4. Following is what you may see at http://0.0.0.0:3000

Remote Execution

  • Make sure all Firewalls are disabled on both, client and server side.
  • Docker-compose will be required in this section.

Server Side

  • docker-compose -f gridnetwork-compose.yml up

Client Side

  • STEP 1: Configure the environment variable called NETWORK, and replace it with <SERVER_IP_ADDRESS>
  • STEP 2: docker-compose -f gridnode-compose.yml up. You can edit this compose file to add more clients, if you'd like.
  • NOTE: Remote execution has not yet been tested properly.
  • In Progress...

Running DVC stages

  • DVC stages are in dvc.yaml file, to run dvc stage just use dvc repro <stage_name>

Notebooks

  • Notebooks, given in this repository, simulate decentralized training using 2 clients.
  • Docker-compose will be required in this section as well!
  • STEP 1: docker-compose -f notebook-docker-compose.yml up
  • STEP 2: conda activate pysyft_v028 (or source activate pysyft_v028 for older versions of conda)
  • STEP 3: Go to the following addresses:
['http://0.0.0.0:5000', 'http://0.0.0.0:5000/connected-nodes', 'http://0.0.0.0:5000/search-available-tags', 'http://0.0.0.0:3000', 'http://0.0.0.0:3001']
  • STEP 4: Initialize jupyter lab
  • STEP 5: Run data owner notebook: notebooks/data-owner_GTEx.ipynb
  • STEP 6: Run model owner notebook: notebooks/model-owner_GTEx.ipynb
  • STEP 7: STOP Node/Network running containers:
docker rm $(docker stop $(docker ps -a -q --filter ancestor=srijanverma44/grid-network:v028 --format="{{.ID}}"))
docker rm $(docker stop $(docker ps -a -q --filter ancestor=srijanverma44/grid-node:v028 --format="{{.ID}}"))

NOTE:

  • Notebooks given in this repository have been taken from this branch and have been modified.

Hyperparameter Optimization

python src/tune.py --help
  • In Progress...

Testing

  • Test Centralized training:
dvc repro centralized_test
  • Test Decentralized training:
dvc repro decentralized_test

Known Issues

  1. While creating an environment:
    • While creating an env. on a linux machine, you may get the following error: No space left on device. (refer here)
    • Solution:
      • export TMPDIR=$HOME/tmp (i.e. change /tmp directory location)
      • mkdir -p $TMPDIR
      • source ~/.bashrc , and then run the following command -
      • conda env create -f environment.yml
  2. While training:
    • Some errors while training in a decentralized way:
      • ImportError: sys.meta_path is None, Python is likely shutting down
      • Solution - NOT YET RESOLVED!
  3. Notebooks:
    • Data transmission rate (i.e, sending large-sized tensors to the nodes) may be slow. (refer this)

Tutorials / References

  1. OpenMined Welcome Page, high level organization and projects
  2. OpenMined full stack, well explained
  3. Understanding PyGrid and the use of data-centric FL
  4. OpenMined RoadMap
  5. What is PyGrid demo
  6. Iterative, DVC: Data Version Control - Git for Data & Models (2020) DOI:10.5281/zenodo.012345.
  7. iterative.ai
  8. DVC Tutorials

Project Status

Under Development: Please note that the project is in its early development stage and all the features have not been tested yet.

Acknowledgements

  1. I would like to thank all my mentors for taking the time to mentor me and for their invaluable suggestions throughout. I truly appreciate their constant trust and encouragement!

  2. Open Bioinformatics Foundation admins, helpdesk and the whole community

  3. OpenMined Community, for putting together such a beautiful tech stack and for their constant help throughout!

  4. Systems Biology of Aging Group, for providing me with useful resources, for trusting me throughout and for their constant feedback!

  5. Iterative.ai and DVC, for making all of our lives so much more easier now :)

  6. GSoC organizers, managers and Google.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].