All Projects → plger → pipeComp

plger / pipeComp

Licence: GPL-3.0 license
A R framework for pipeline benchmarking, with application to single-cell RNAseq

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to pipeComp

Mlj.jl
A Julia machine learning framework
Stars: ✭ 982 (+2484.21%)
Mutual labels:  clustering, pipelines
scarf
Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.
Stars: ✭ 54 (+42.11%)
Mutual labels:  clustering, single-cell-rna-seq
RcppML
Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
Stars: ✭ 52 (+36.84%)
Mutual labels:  clustering
connectomemapper3
Connectome Mapper 3 is a BIDS App that implements full anatomical, diffusion, resting/state functional MRI, and recently EEG processing pipelines, from raw T1 / DWI / BOLD , and preprocessed EEG data to multi-resolution brain parcellation with corresponding connection matrices.
Stars: ✭ 45 (+18.42%)
Mutual labels:  pipelines
xslweb
Web application framework for XSLT and XQuery developers
Stars: ✭ 39 (+2.63%)
Mutual labels:  pipelines
ensembldb
This is the ensembldb development repository.
Stars: ✭ 31 (-18.42%)
Mutual labels:  bioconductor
A-quantum-inspired-genetic-algorithm-for-k-means-clustering
Implementation of a Quantum inspired genetic algorithm proposed by A quantum-inspired genetic algorithm for k-means clustering paper.
Stars: ✭ 28 (-26.32%)
Mutual labels:  clustering
Taiji
All-in-one analysis pipeline
Stars: ✭ 28 (-26.32%)
Mutual labels:  single-cell-rna-seq
RBioFormats
📚 R interface to the Bio-Formats library
Stars: ✭ 20 (-47.37%)
Mutual labels:  bioconductor
northstar
Single cell type annotation guided by cell atlases, with freedom to be queer
Stars: ✭ 23 (-39.47%)
Mutual labels:  clustering
syncflux
SyncFlux is an Open Source InfluxDB Data synchronization and replication tool for migration purposes or HA clusters
Stars: ✭ 145 (+281.58%)
Mutual labels:  clustering
ngs-in-bioc
A course on Analysing Next Generation (/High Throughput etc..) Sequencing data using Bioconductor
Stars: ✭ 37 (-2.63%)
Mutual labels:  bioconductor
Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
Stars: ✭ 126 (+231.58%)
Mutual labels:  clustering
clustering-python
Different clustering approaches applied on different problemsets
Stars: ✭ 36 (-5.26%)
Mutual labels:  clustering
bioc 2020 tidytranscriptomics
Workshop on tidytranscriptomics: Performing tidy transcriptomics analyses with tidybulk, tidyverse and tidyheatmap
Stars: ✭ 25 (-34.21%)
Mutual labels:  bioconductor
eris-fleet
Cluster management for Discord bots using the Eris library.
Stars: ✭ 38 (+0%)
Mutual labels:  clustering
clusterix
Visual exploration of clustered data.
Stars: ✭ 44 (+15.79%)
Mutual labels:  clustering
Heart disease prediction
Heart Disease prediction using 5 algorithms
Stars: ✭ 43 (+13.16%)
Mutual labels:  clustering
scaden
Deep Learning based cell composition analysis with Scaden.
Stars: ✭ 61 (+60.53%)
Mutual labels:  single-cell-rna-seq
mathematics-statistics-for-data-science
Mathematical & Statistical topics to perform statistical analysis and tests; Linear Regression, Probability Theory, Monte Carlo Simulation, Statistical Sampling, Bootstrapping, Dimensionality reduction techniques (PCA, FA, CCA), Imputation techniques, Statistical Tests (Kolmogorov Smirnov), Robust Estimators (FastMCD) and more in Python and R.
Stars: ✭ 56 (+47.37%)
Mutual labels:  clustering

pipeComp

pipeComp is a simple framework to facilitate the comparison of pipelines involving various steps and parameters. It was initially developed to benchmark single-cell RNA sequencing pipelines:

pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools
Pierre-Luc Germain, Anthony Sonrel & Mark D Robinson, Genome Biology 2020, doi: 10.1186/s13059-020-02136-7

However the framework can be applied to any other context (see the pipeComp_dea vignette for an example). This readme provides an overview of the framework and package. For more detail, please refer to the two vignettes.



Introduction

pipeComp is especially suited to the benchmarking of pipelines that include many steps/parameters, enabling the exploration of combinations of parameters and of the robustness of methods to various changes in other parts of a pipeline. It is also particularly suited to benchmarks across multiple datasets. It is entirely based on R/Bioconductor, meaning that methods outside of R need to be called via R wrappers. pipeComp handles multithreading in a way that minimizes re-computation and duplicated memory usage, and computes evaluation metrics on the fly to avoid saving many potentially large intermediate files, making it well-suited for benchmarks involving large datasets.

This readme gives a very brief overview of the package. For more detailed information on the framework, refer to the pipeComp vignette. For information specifically about the scRNAseq pipeline and evaluation metrics (as well as more complex examples usages of the plotting functions), see the pipeComp_scRNA vignette. For a completely different example, with walkthrough the creating of a new PipelineDefinition, see the pipeComp_dea vignette.

Recent changes

  • In pipeComp 0.99.43, there is now the possibility to continue runs despite errors (see the skipErrors argument of runPipeline, and the 'Handling errors' section of the pipeComp vignette.).

  • In pipeComp 0.99.26 on, the plotting functions for the scRNAseq clustering pipeline (scrna_evalPlot_DR and scrna_evalPlot_clust) have been replaced by more flexible, pipeline-generic functions (evalHeatmap) and a silhouette-specific plotting function (scrna_evalPlot_silh). The general heatmap coloring scheme has also been changed to make meaningful changes clearer.

  • In pipeComp 0.99.24, multithreading capacities have been extended (now virtually no limit).

  • pipeComp >=0.99.3 made important changes to the format of the output, and greatly simplified the evaluation outputs for the scRNA pipeline.As a result, results produced with older version of the package are not anymore compatible with the current version's aggregation and plotting functions.

Installation

Install using:

BiocManager::install("plger/pipeComp", build_vignettes=TRUE)

Due to Bioconductor standards, pipeComp requires R>=4, but it is actually compatible with R>=3.6.1 (users who have not yet moved to R4 can use the R3.6 branch).

Because pipeComp was meant as a general pipeline benchmarking framework, we have tried to restrict the package's dependencies to a minimum. To use the scRNA-seq pipeline and wrappers, however, requires further packages to be installed. To check whether these dependencies are met for a given pipelineDefinition and set of alternatives, see ?checkPipelinePackages.



Using pipeComp

Scheme of the pipeComp framework

Scheme of the pipeComp framework. A: The `PipelineDefinition` class represents pipelines as, minimally, a set of functions consecutively executed on the output of the previous one, and optionally accompanied by evaluation and aggregation functions. B: Given a `PipelineDefinition`, a set of alternative parameters and benchmark datasets, the `runPipeline` function proceeds through all combinations arguments, avoiding recomputing the same step twice and compiling evaluations on the fly.

PipelineDefinition

As represented in the figure above, the PipelineDefinition S4 class represents pipelines as, minimally, a set of functions (accepting any number of parameters) consecutively executed on the output of the previous one, and optionally accompanied by evaluation and aggregation functions. As simple pipeline can be constructed as follows:

my_pip <- PipelineDefinition( list( step1=function(x, param1){
                                      # do something with x and param1
                                      x
                                    },
                                    step2=function(x, method1, param2){
                                      get(method1)(x, param2)
                                    },
                                    step3=function(x, param3){
                                      x <- some_fancy_function(x, param3)
                                      # the functions can also output evaluation
                                      # through the `intermediate_return` slot:
                                      e <- my_evaluation_function(x)
                                      list( x=x, intermediate_return=e)
                                    }
                                  ))

The PipelineDefinition can also include descriptions of each step or evaluation and aggregation functions. For example:

my_pip <- PipelineDefinition( list( step1=function(x, meth1){ get(meth1)(x) },
                                    step2=function(x, meth2){ get(meth2)(x) } ),
                              evaluation=list( step2=function(x){ sum(x) }) )

See the ?PipelineDefinition for more information, or scrna_pipeline for a more complex example:

pipDef <- scrna_pipeline()
pipDef

A PipelineDefinition object with the following steps:
  - doublet(x, doubletmethod) *
Takes a SCE object with the `phenoid` colData column, passes it through the 
function `doubletmethod`, and outputs a filtered SCE.
  - filtering(x, filt) *
Takes a SCE object, passes it through the function `filt`, and outputs a 
filtered Seurat object.
  - normalization(x, norm)
Passes the object through function `norm` to return the object with the 
normalized and scale data slots filled.
  - selection(x, sel, selnb)
Returns a seurat object with the VariableFeatures filled with `selnb` features 
using the function `sel`.
  - dimreduction(x, dr, maxdim) *
Returns a seurat object with the PCA reduction with up to `maxdim` components 
using the `dr` function.
  - clustering(x, clustmethod, dims, k, steps, resolution, min.size) *
Uses function `clustmethod` to return a named vector of cell clusters.

Manipulating PipelineDefinition objects

A number of generic methods are implemented on the object, including show, names, length, [, as.list. This means that, for instance, a step can be removed from a pipeline in the following way:

pd2 <- pipDef[-1]

Steps can also be added (using the addPipelineStep function) and edited - see the pipeComp vignette for more detail:

vignette("pipeComp", package="pipeComp")



Running pipelines

Preparing the other arguments

runPipeline requires 3 main arguments: i) the pipelineDefinition, ii) the list of alternative parameters values to try, and iii) the list of benchmark datasets.

The scRNAseq datasets used in the papers can be downloaded from figshare and prepared in the following way:

download.file("https://ndownloader.figshare.com/articles/11787210/versions/1", "datasets.zip")
unzip("datasets.zip", exdir="datasets")
datasets <- list.files("datasets", pattern="SCE\\.rds", full.names=TRUE)
names(datasets) <- sapply(strsplit(basename(datasets),"\\."),FUN=function(x) x[1])

Next we prepare the alternative methods and parameters. Functions can be passed as arguments through their name (if they are loaded in the environment):

# load alternative functions
source(system.file("extdata", "scrna_alternatives.R", package="pipeComp"))
# we build the list of alternatives
alternatives <- list(
  doubletmethod=c("none"),
  filt=c("filt.lenient", "filt.stringent"),
  norm=c("norm.seurat", "norm.sctransform", "norm.scran"),
  sel=c("sel.vst"),
  selnb=2000,
  dr=c("seurat.pca"),
  clustmethod=c("clust.seurat"),
  dims=c(10, 15, 20, 30),
  resolution=c(0.01, 0.1, 0.2, 0.3, 0.5, 0.8, 1, 1.2, 2)
)

Running the analyses

res <- runPipeline( datasets, alternatives, pipDef, nthreads=3,
                    output.prefix="myfolder/" )

Exploring the metrics

Data can be explored manually or plotted using generic or pipeline-specific functions. For example:

scrna_evalPlot_silh( res )

evalHeatmap( res, step="dimreduction", what2="meanAbsCorr.covariate2", 
             what=c("log10_total_features","log10_total_counts") )

The functions enable the choice of parameters at whose values to aggregate, as well as custom filtering:

evalHeatmap(res, step = "clustering", what=c("MI","ARI"), agg.by=c("filt","norm")) +
  evalHeatmap(res, step = "clustering", what="ARI", agg.by=c("filt", "norm"),
              filter=n_clus==true.nbClusts, title="ARI at\ntrue k")

See the vignettes and the function's help for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].