All Projects → SchlossLab → Topcuoglu_ML_mBio_2020

SchlossLab / Topcuoglu_ML_mBio_2020

Licence: MIT license
Best practices for applying machine learning to bacterial 16S rRNA gene sequencing data

Programming Languages

r
7636 projects
TeX
3793 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Topcuoglu ML mBio 2020

microbiomeHD
Cross-disease comparison of case-control gut microbiome studies
Stars: ✭ 58 (+176.19%)
Mutual labels:  reproducible-research, reproducible-paper
Reproducibilty-Challenge-ECANET
Unofficial Implementation of ECANets (CVPR 2020) for the Reproducibility Challenge 2020.
Stars: ✭ 27 (+28.57%)
Mutual labels:  reproducible-research, reproducible-paper
DUN
Code for "Depth Uncertainty in Neural Networks" (https://arxiv.org/abs/2006.08437)
Stars: ✭ 65 (+209.52%)
Mutual labels:  reproducible-research, reproducible-paper
software-dev
Coding Standards for the USC Biostats group
Stars: ✭ 33 (+57.14%)
Mutual labels:  reproducible-research
binderhub-deploy
Deploy a BinderHub from scratch on Microsoft Azure
Stars: ✭ 27 (+28.57%)
Mutual labels:  reproducible-research
ukbrest
ukbREST: efficient and streamlined data access for reproducible research of large biobanks
Stars: ✭ 32 (+52.38%)
Mutual labels:  reproducible-research
reskit
A library for creating and curating reproducible pipelines for scientific and industrial machine learning
Stars: ✭ 27 (+28.57%)
Mutual labels:  reproducible-research
reproducibility-guide
⛔ ARCHIVED ⛔
Stars: ✭ 119 (+466.67%)
Mutual labels:  reproducible-research
us-rawdata-sda
A Deep Learning Approach to Ultrasound Image Recovery
Stars: ✭ 39 (+85.71%)
Mutual labels:  reproducible-research
ck-crowd-scenarios
Public scenarios to crowdsource experiments (such as DNN crowd-benchmarking and crowd-tuning) using Collective Knowledge Framework across diverse mobile devices provided by volunteers. Results are continuously aggregated at the open repository of knowledge:
Stars: ✭ 22 (+4.76%)
Mutual labels:  reproducible-research
genepattern-notebook
Platform for integrating genomic analysis with Jupyter Notebooks.
Stars: ✭ 37 (+76.19%)
Mutual labels:  reproducible-research
openscience
Empirical Software Engineering journal (EMSE) open science and reproducible research initiative
Stars: ✭ 28 (+33.33%)
Mutual labels:  reproducible-research
ITKPythonPackage
A setup script to generate ITK Python Wheels
Stars: ✭ 59 (+180.95%)
Mutual labels:  reproducible-research
wrench
WRENCH: Cyberinfrastructure Simulation Workbench
Stars: ✭ 25 (+19.05%)
Mutual labels:  reproducible-research
jupyter-guide
Guide for Reproducible Research and Data Science in Jupyter Notebooks
Stars: ✭ 111 (+428.57%)
Mutual labels:  reproducible-research
ck-analytics
Collective Knowledge repository with actions to unify the access to different predictive analytics engines (scipy, R, DNN) from software, command line and web-services via CK JSON API:
Stars: ✭ 35 (+66.67%)
Mutual labels:  reproducible-research
genepattern-server
The GenePattern Server web application
Stars: ✭ 26 (+23.81%)
Mutual labels:  reproducible-research
OpenPlantPathology
Open Plant Pathology website
Stars: ✭ 18 (-14.29%)
Mutual labels:  reproducible-research
awflow
Reproducible research and reusable acyclic workflows in Python. Execute code on HPC systems as if you executed them on your personal computer!
Stars: ✭ 15 (-28.57%)
Mutual labels:  reproducible-research
showyourwork
Fully reproducible, open source scientific articles in LaTeX.
Stars: ✭ 361 (+1619.05%)
Mutual labels:  reproducible-research

A framework for effective application of machine learning to microbiome-based classification problems

Abstract

Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made towards developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs; n=490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1 and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, decision trees, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an AUROC of 0.695 [IQR 0.651-0.739] but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 [IQR 0.625-0.735], trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.

Overview

project
|- README         		# the top level description of content (this doc)
|- CONTRIBUTING    		# instructions for how to contribute to your project
|- LICENSE         		# the license for this project
|
|- data/           		# raw and primary data, are not changed once created
| |- process/     		# .tsv and .csv files generated with main.R that runs the models
| |- baxter.0.03.subsample.shared      	# subsampled mothur generated file with OTUs from Marc Sze's analysis
| |- metadata.tsv     		        # metadata with clinical information from Marc Sze's analysis 		
|- code/          			# any programmatic code
| |- learning/    			# generalization performance of model
| |- testing/     			# building final model
|
|- results/        			# all output from workflows and analyses
| |- tables/      			# tables and .Rmd code of the tables to be rendered with kable in R
| |- figures/     			# graphs, likely designated for manuscript figures
|
|- submission/
| |- manuscript.Rmd 			# executable Rmarkdown for this study, if applicable
| |- manuscript.md 			# Markdown (GitHub) version of the *.Rmd file
| |- manuscript.tex 			# TeX version of *.Rmd file
| |- manuscript.pdf 			# PDF version of *.Rmd file
| |- header.tex 			# LaTeX header file to format pdf version of manuscript
| |- references.bib 			# BibTeX formatted references
|
|- Makefile	 # Reproduce the manuscript, figures and tables

How to use the outlined ML pipeline for your own project

How to regenerate this repository in R

Please take a look at the Makefile for more information about the workflow. Please also read the submission/manuscript.pdf to get a more detailed look on what we achieve with this ML pipeline.

  1. Clone the Github Repository and change directory to the project directory.
git clone https://github.com/SchlossLab/Topcuoglu_ML_XXX_2019.git
cd DeepLearning
  1. Our dependencies:

    • R version 3.5.0

    • The R packages which needs to be installed in our environment: caret ,rpart, xgboost, randomForest, kernlab,LiblineaR, pROC, tidyverse, cowplot, ggplot2, vegan,gtools, reshape2.

    • Everything needs to be run from project directory.

    • We need to download 2 datasets (OTU abundances and colonoscopy diagnosis of 490 patients) from *Sze MA, Schloss PD. 2018. Leveraging existing 16S rRNA gene surveys to identify reproducible biomarkers in individuals with colorectal tumors. mBio 9:e00630–18.

    • We update the caret package with my modifications by running (Take a look at this script to change the R packages directory where caret is installed.):

      Rscript code/learning/load_caret_models.R

    These modifications are in data/caret_models/svmLinear3.R and data/caret_models/svm_Linear4.R

  2. Follow the Makefile to generate the manuscript.

    • The Makefile uses code/learning/main.R to run the pipeline which sources 4 other scripts that are part of the pipeline.

      • To choose the model and model hyperparemeters:code/learning/model_selection.R'

      • To preprocess and split the dataset 80-20 and to train the model: code/learning/model_pipeline.R

      • To save the results of each model for each datasplit: code/learning/generateAUCs.R

      • To interpret the models: code/learning/permutation_importance.R

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].