All Projects → greenelab → Pancancer

greenelab / Pancancer

Licence: bsd-3-clause
Building classifiers using cancer transcriptomes across 33 different cancer-types

Projects that are alternatives of or similar to Pancancer

Algorithmictrading
This repository contains three ways to obtain arbitrage which are Dual Listing, Options and Statistical Arbitrage. These are projects in collaboration with Optiver and have been peer-reviewed by staff members of Optiver.
Stars: ✭ 157 (+86.9%)
Mutual labels:  analysis, jupyter-notebook
Osintgram
Osintgram is a OSINT tool on Instagram. It offers an interactive shell to perform analysis on Instagram account of any users by its nickname
Stars: ✭ 312 (+271.43%)
Mutual labels:  analysis, tool
Hiper
🚀 A statistical analysis tool for performance testing
Stars: ✭ 2,667 (+3075%)
Mutual labels:  analysis, tool
Analyzing neural time series
python implementations of Analyzing Neural Time Series Textbook
Stars: ✭ 117 (+39.29%)
Mutual labels:  analysis, jupyter-notebook
Multi Plier
An unsupervised transfer learning approach for rare disease transcriptomics
Stars: ✭ 33 (-60.71%)
Mutual labels:  analysis, methodology
Tybalt
Training and evaluating a variational autoencoder for pan-cancer gene expression data
Stars: ✭ 126 (+50%)
Mutual labels:  analysis, tool
snorkeling
Extracting biomedical relationships from literature with Snorkel 🏊
Stars: ✭ 56 (-33.33%)
Mutual labels:  analysis, methodology
shared-latent-space
Shared Latent Space VAE's
Stars: ✭ 15 (-82.14%)
Mutual labels:  analysis, methodology
Machine Learning
Machine learning for Project Cognoma
Stars: ✭ 30 (-64.29%)
Mutual labels:  jupyter-notebook, classifier
Stockpriceprediction
Stock Price Prediction using Machine Learning Techniques
Stars: ✭ 700 (+733.33%)
Mutual labels:  analysis, jupyter-notebook
Dart Code Metrics
Software analytics tool that helps developers analyse and improve software quality.
Stars: ✭ 96 (+14.29%)
Mutual labels:  analysis, tool
Vehicle Detection And Tracking
Udacity Self-Driving Car Engineer Nanodegree. Project: Vehicle Detection and Tracking
Stars: ✭ 60 (-28.57%)
Mutual labels:  jupyter-notebook, classifier
Keras transfer cifar10
Object classification with CIFAR-10 using transfer learning
Stars: ✭ 120 (+42.86%)
Mutual labels:  jupyter-notebook, classifier
Pastas
🍝 Pastas is an open-source Python framework for the analysis of hydrological time series.
Stars: ✭ 155 (+84.52%)
Mutual labels:  analysis, jupyter-notebook
Url Classification
Machine learning to classify Malicious (Spam)/Benign URL's
Stars: ✭ 95 (+13.1%)
Mutual labels:  jupyter-notebook, classifier
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (+546.43%)
Mutual labels:  jupyter-notebook, classifier
Daps
Denoising Autoencoders for Phenotype Stratification
Stars: ✭ 39 (-53.57%)
Mutual labels:  analysis, methodology
Andes
Python toolbox / library for power system transient dynamics simulation with symbolic modeling and numerical analysis 🔥
Stars: ✭ 68 (-19.05%)
Mutual labels:  analysis, tool
Ml
Machine learning projects, often on audio datasets
Stars: ✭ 83 (-1.19%)
Mutual labels:  jupyter-notebook
Airflow project
scaffold of Apache Airflow executing Docker containers
Stars: ✭ 84 (+0%)
Mutual labels:  jupyter-notebook

Gene expression machine learning classifiers from TCGA PanCancerAtlas

Gregory Way and Casey Greene

Detecting system-wide changes in whole transcriptomes

A transcriptome can describe the total state of a tumor at a snapshot in time. In this repository, we use cancer transcriptomes from The Cancer Genome Atlas PanCancerAtlas project to interrogate gene expression states induced by deleterious mutations and copy number alterations.

The code in this repository is flexible and can build a Pan-Cancer classifier for any combination of genes and cancer-types using gene expression, mutation, and copy number data. In this repository, we provide examples for building classifiers to detect aberration in TP53 and Ras signalling.

Ras Signalling

DOI

The Ras signalling pathway is a major player in cancer development and treatment resistance. We observed that nearly 60% of all tumors in TCGA have mutations or copy number alterations in at least one of 38 core pathway genes (Sanchez-Vega et al. 2018).

We applied our approach to detect Ras pathway activation using KRAS, HRAS, and NRAS gain of function mutations and copy number gains to define our gold standard Ras hyperactivation events. We train a supervised classifier to detect when a tumor has activated Ras.

For more details about the approach, see our paper published in Cell Reports. The paper should be cited as:

Way, GP, Sanchez-Vega, F, La, K, Armenia, J, Chatila, WK, Luna, A, Sander, A, Cherniack, AD, Mina, M, Ciriello, G, Schultz, N., The Cancer Genome Atlas Research Network, Sanchez, Y, Greene, CS. 2018. Machine Learning Detects Pan-cancer Ras Pathway Activation in The Cancer Genome Atlas. Cell Reports 23(1):172-180.e3 doi:10.1016/j.celrep.2018.03.046

Ras signalling classifier identifies phenocopying NF1 loss of function events

We have previously described the ability of a machine learning classifier to detect an NF1 inactivation signature using Glioblastoma data (Way et al. 2016). There, we applied an ensemble of logistic regression classifiers to the problem, but the solutions were unstable and overfit. To address these issues, we posited that we could leverage data from diverse cancer types to build a pancancer NF1 classifier. We also hypothesized that a Ras classifier would be able to detect tumors with NF1 inactivation since NF1 directly inhibits RAS activity.

TP53

DOI

We are also interested in building a classifier to detect TP53 inactivation. TP53 is the most highly mutated gene in cancer and regulates several important oncogenic processes such as apoptosis and DNA damage response (DDR). We include a pipeline to build and evaluate a machine learning TP53 classifier. See tp53_analysis.sh for more details.

The description for this analysis can be viewed in the following publication:

Knijnenburg, TA, Wang, L, Zimmermann, MT, Chambwe, N, Gao, GF, Cherniack AD, Fan, H, Shen, H, Way, GP, Greene, CS, Liu, Y, Akbani, R, Feng, B, Donehower, LA, Miller, C, Shen, Y, Karimi, M, Chen, H, Kim, P, Jia, P, Shinbrot, E, Zhang, S, Liu, J, Hu, H, Bailey, MH, Yau, C, Wolf, D, Zhao, Z, Weinstein, J, Li, L, Ding, L, Mills, GB, Laird, PW, Wheeler, DA, Shmulevich, I, The Cancer Genome Atlas Research Network, Monnat Jr, RJ, Xiao, Y, Wang, C. 2018. Genomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome Atlas. Cell Reports 23(1):239-254.e3 doi:10.1016/j.celrep.2018.03.076

Open Access Data

All data was released by the TCGA PanCancerAtlas project. The compendium of papers is described here. Supplementary data from these papers can be downloaded from the NCI. The specific data used in the analyses presented here are archived on Zenodo Gene expression) and copy number data can be accessed here.

See scripts/initialize/download_data.sh for more details.

Also note that the definitions for all TCGA cancer-type acronyms is stored in data/tcga_dictionary.tsv.

Usage

Initialization

The repository must be cloned onto local machine before analyses can proceed.

# Make sure git-lfs (https://git-lfs.github.com/) is installed before cloning
# If not, run `git lfs install`
git clone [email protected]:greenelab/pancancer.git

cd pancancer

Example Scripts

We provide two distinct example pipelines for predicting TP53 and NF1/RAS loss of function.

  1. TP53 loss of function (see tp53_analysis.sh)
  2. Ras signaling hyperactivation (see ras_analysis.sh)

Customization

For custom analyses, use the scripts/pancancer_classifier.py script with command line arguments.

python scripts/pancancer_classifier.py ...
Flag Required/Default Description
--genes Required Build a classifier for the input gene symbols
--diseases Auto The disease types to use in building the classifier
--folds 5 Number of cross validation folds
--drop False Decision to drop input genes from expression matrix
--copy_number False Integrate copy number data to gene event
--filter_count 15 Default options to filter diseases if none are specified
--filter_prop 0.05 Default options to filter diseases if none are specified
--num_features 8000 Number of MAD genes used to build classifier
--alphas 0.1,0.15,0.2,0.5,0.8,1 The alpha grid to search over in parameter sweep
--l1_ratios 0,0.1,0.15,0.18,0.2,0.3 The l1 ratio grid to search over in parameter sweep
--alt_genes None Alternative genes to test classifier performance
--alt_diseases Auto Alternative diseases to test classifier performance
--alt_filter_count 15 Filtering used for alternative disease classification
--alt_filter_prop 0.05 Filtering used for alternative disease classification
--alt_folder Auto Location to save all classifier figures
--remove_hyper False Decision to remove hyper mutated tumors
--keep_intermediate False Decision to keep intermediate ROC curve metrics
--x_matrix raw if not "raw", then the filename storing the features
--shuffled False Shuffle the X matrix for better training
--shuffled_before_training False Remove correlational structure in the data
--no_mutation True Decision to remove mutation data from the input matrix
--drop_rasopathy False Decision to drop all rasopathy genes from the X matrix
--drop_covariates False Decision to drop all covariate information from the X matrix
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].