All Projects → greenelab → RNAseq_titration_results

greenelab / RNAseq_titration_results

Licence: BSD-3-Clause License
Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously

Programming Languages

r
7636 projects
shell
77523 projects
python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to RNAseq titration results

Awesome Single Cell
Community-curated list of software packages and data resources for single-cell, including RNA-seq, ATAC-seq, etc.
Stars: ✭ 1,937 (+8704.55%)
Mutual labels:  bioinformatics, analysis, gene-expression
Sns
Analysis pipelines for sequencing data
Stars: ✭ 43 (+95.45%)
Mutual labels:  bioinformatics, analysis
Multiqc
Aggregate results from bioinformatics analyses across many samples into a single report.
Stars: ✭ 708 (+3118.18%)
Mutual labels:  bioinformatics, analysis
Terpene Profile Parser For Cannabis Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
Stars: ✭ 63 (+186.36%)
Mutual labels:  bioinformatics, analysis
2020plus
Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests
Stars: ✭ 44 (+100%)
Mutual labels:  bioinformatics, cancer
Scde
R package for analyzing single-cell RNA-seq data
Stars: ✭ 147 (+568.18%)
Mutual labels:  bioinformatics, analysis
adage
Data and code related to the paper "ADAGE-Based Integration of Publicly Available Pseudomonas aeruginosa..." Jie Tan, et al · mSystems · 2016
Stars: ✭ 61 (+177.27%)
Mutual labels:  gene-expression, supplement
PHAT
Pathogen-Host Analysis Tool - A modern Next-Generation Sequencing (NGS) analysis platform
Stars: ✭ 17 (-22.73%)
Mutual labels:  bioinformatics, analysis
cancer-data
TCGA data acquisition and processing for Project Cognoma
Stars: ✭ 17 (-22.73%)
Mutual labels:  cancer, gene-expression
atacr
Analysing Capture Seq Count Data
Stars: ✭ 14 (-36.36%)
Mutual labels:  analysis, rnaseq
hotspot3d
3D hotspot mutation proximity analysis tool
Stars: ✭ 43 (+95.45%)
Mutual labels:  analysis, cancer
GenomicDataCommons
Provide R access to the NCI Genomic Data Commons portal.
Stars: ✭ 64 (+190.91%)
Mutual labels:  bioinformatics, cancer
SemiBin
No description or website provided.
Stars: ✭ 25 (+13.64%)
Mutual labels:  bioinformatics
antigen.garnish
No description or website provided.
Stars: ✭ 34 (+54.55%)
Mutual labels:  bioinformatics
dna-traits
A fast 23andMe genome text file parser, now superseded by arv
Stars: ✭ 64 (+190.91%)
Mutual labels:  bioinformatics
civic-client
Web client for CIViC: Clinical Interpretations of Variants in Cancer
Stars: ✭ 49 (+122.73%)
Mutual labels:  cancer
xbpch
xarray interface for bpch files
Stars: ✭ 17 (-22.73%)
Mutual labels:  analysis
MalScan
A Simple PE File Heuristics Scanners
Stars: ✭ 41 (+86.36%)
Mutual labels:  analysis
keen-analysis.js
A light JavaScript client for Keen
Stars: ✭ 40 (+81.82%)
Mutual labels:  analysis
tweetsOLAPing
implementing an end-to-end tweets ETL/Analysis pipeline.
Stars: ✭ 24 (+9.09%)
Mutual labels:  analysis

Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously

The full output of a version of this analysis is available at Figshare under the DOI: 10.6084/m9.figshare.5035997.v2

Summary

We performed a series of supervised and unsupervised machine learning evaluations, as well as differential expression analyses, to assess which normalization methods are best suited for combining data from microarray and RNA-seq platforms.

We evaluated five normalization approaches for all methods:

  1. log-transformation (LOG)
  2. non-paranormal transformation (NPN)
  3. quantile normalization (QN)
  4. Training Distribution Matching (TDM)
  5. standardizing scores (z-scoring; Z).

A version of this project is detailed in our pre-print Cross-Platform Normalization Enables Machine Learning Model Training On Microarray And RNA-Seq Data Simultaneously.

We are actively making improvements to this codebase; see #12.

Breast Cancer Data

DOI

The Cancer Genome Atlas BRCA data used for these analyses is available at zenodo.

# To download data, run in top directory:
sh brca_data_download.sh

Analysis

Machine Learning Pipeline

Here's a schematic overview of our machine learning experiments:

Overview of supervised and unsupervised machine learning experiments.

  1. 520 TCGA Breast Cancer samples run on both microarray and RNA-seq were split into a training (2/3) and holdout set (1/3).
  2. RNA-seq’d samples were "titrated" into the training set, 10% at a time (0-100%) resulting in eleven training sets for each normalization method.
  3. Machine learning applications. Three supervised multi-class (BRCA PAM50 subtype) classifiers—LASSO, linear SVM, and Random Forest—were trained on each training set and tested on the microarray and RNA-seq holdout sets. The holdout sets were projected onto and back out of the training set space using two unsupervised techniques, Independent and Principal Components Analysis, to obtain reconstructed holdout sets. The classifiers used in step 4A above were used to predict on the reconstructed holdout sets.
# To run the machine learning pipeline, run in top directory:
sh run_machine_learning_experiments.sh

# To run one repeat of the subtype classifier pipeline, use:
Rscript run_experiments.R

Differential Expression Pipeline

Here's a schematic overview of our main differential expression experiment:

Overview of differential expression experiment.

  1. All matched TCGA breast cancer samples (n = 520) were considered when building the platform-specific “silver standards.” These standards are the genes that were differentially expressed at a specified False Discovery Rate (FDR) using data sets comprised entirely of one platform and processed in a standard way: log2-transformed microarray data and “untransformed” RSEM count data (preprocessed using the limma::voom function).
  2. RNA-seq’d samples were ‘titrated’ into the data set, 10% at a time (0-100%) resulting in eleven experimental sets for each n ormalization method.
  3. Differentially expressed genes (DEGs) were identified using the limma package. We compared the Her2 and LumA subtypes as well as Basal v. all other samples.
  4. Lists of experimental DEGs were compared to standard gene sets using Jaccard similarity.
# Note: This requires the data to be processed to include matched samples only, 
# and split into training and test sets (0-expression_data_overlap_and_split.R)

# To run the differential expression pipeline, run in top directory:
sh run_differential_expression_experiments.sh

Requirements

This analysis was performed in R. It requires R & Bioconductor packages detailed in check_installs.R to be installed.

One github package (TDM) is required. To install, run:

library(devtools)
devtools::install_github("greenelab/TDM")

This analysis is in the process of being moved to a Docker image.

Funding

This work was supported the Gordon and Betty Moore Foundation [GBMF 4552] and the National Institutes of Health [T32-AR007442, U01-TR001263].

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].