All Projects → debsin → dropClust

debsin / dropClust

Licence: other
Version 2.1.0 released

Programming Languages

r
7636 projects
C++
36643 projects - #6 most used programming language
TeX
3793 projects

Projects that are alternatives of or similar to dropClust

RATTLE
Reference-free reconstruction and error correction of transcriptomes from Nanopore long-read sequencing
Stars: ✭ 35 (+84.21%)
Mutual labels:  rna-seq, genomics, clustering
Ml Course
Starter code of Prof. Andrew Ng's machine learning MOOC in R statistical language
Stars: ✭ 154 (+710.53%)
Mutual labels:  clustering, pca
Hdbscan
A high performance implementation of HDBSCAN clustering.
Stars: ✭ 2,032 (+10594.74%)
Mutual labels:  clustering, cluster-analysis
clustering-python
Different clustering approaches applied on different problemsets
Stars: ✭ 36 (+89.47%)
Mutual labels:  clustering, cluster-analysis
Nanny
A tidyverse suite for (pre-) machine-learning: cluster, PCA, permute, impute, rotate, redundancy, triangular, smart-subset, abundant and variable features.
Stars: ✭ 17 (-10.53%)
Mutual labels:  clustering, pca
Ml code
A repository for recording the machine learning code
Stars: ✭ 75 (+294.74%)
Mutual labels:  clustering, pca
genie
Genie: A Fast and Robust Hierarchical Clustering Algorithm (this R package has now been superseded by genieclust)
Stars: ✭ 21 (+10.53%)
Mutual labels:  clustering, cluster-analysis
northstar
Single cell type annotation guided by cell atlases, with freedom to be queer
Stars: ✭ 23 (+21.05%)
Mutual labels:  clustering, singlecell
scarf
Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.
Stars: ✭ 54 (+184.21%)
Mutual labels:  genomics, clustering
AnnA Anki neuronal Appendix
Using machine learning on your anki collection to enhance the scheduling via semantic clustering and semantic similarity
Stars: ✭ 39 (+105.26%)
Mutual labels:  clustering, pca
dee2
Digital Expression Explorer 2 (DEE2): a repository of uniformly processed RNA-seq data
Stars: ✭ 32 (+68.42%)
Mutual labels:  rna-seq, genomics
Cilantro
A lean C++ library for working with point cloud data
Stars: ✭ 577 (+2936.84%)
Mutual labels:  clustering, pca
2018 Machinelearning Lectures Esa
Machine Learning Lectures at the European Space Agency (ESA) in 2018
Stars: ✭ 280 (+1373.68%)
Mutual labels:  clustering, pca
Msmbuilder
🏗 Statistical models for biomolecular dynamics 🏗
Stars: ✭ 118 (+521.05%)
Mutual labels:  clustering, pca
genieclust
Genie++ Fast and Robust Hierarchical Clustering with Noise Point Detection - for Python and R
Stars: ✭ 34 (+78.95%)
Mutual labels:  clustering, cluster-analysis
CoronaDash
COVID-19 spread shiny dashboard with a forecasting model, countries' trajectories graphs, and cluster analysis tools
Stars: ✭ 20 (+5.26%)
Mutual labels:  clustering, cluster-analysis
pyclustertend
A python package to assess cluster tendency
Stars: ✭ 38 (+100%)
Mutual labels:  clustering, cluster-analysis
clusters
Cluster analysis library for Golang
Stars: ✭ 68 (+257.89%)
Mutual labels:  clustering, cluster-analysis
ml
经典机器学习算法的极简实现
Stars: ✭ 130 (+584.21%)
Mutual labels:  pca
rare-disease-wf
(WIP) best-practices workflow for rare disease
Stars: ✭ 47 (+147.37%)
Mutual labels:  genomics

The latest version of dropClust is now available in desktop and online versions.

New Additions

Improved Interoperability Integrative Analysis Online web-server
SingleCellExperiment Container

dropClust Online

Visit https://debsinha.shinyapps.io/dropClust/ for the online version.

Desktop Installation

The developer version of the R package can be installed with the following R commands:

library(devtools)
install_github("debsin/dropClust", dependencies = T)

Vignette tutorial

This vignette uses a small data set from the 10X website (3K PBMC dataset here ) to demonstrate a standard pipeline. This vignette can be used as a tutorial as well.

Setting up directories

library(dropClust)
set.seed(0)

Loading data

dropClust loads UMI count expression data from three input files. The files follow the same structure as the datasets available from the 10X website, i.e.:

  • count matrix file in sparse format
  • transcriptome identifiers as a TSV file and
  • gene identifiers as a TSV file
# Load Data, path contains decompressed files 
sce <-readfiles(path = "C:/Projects/dropClust/data/pbmc3k/hg19/")

Pre-processing

dropClust performs pre-processing to remove poor quality cells and genes. dropClust is also equipped to mitigate batch-effects that may be present. The user does not need to provide any information regarding the source of the batch for individual transcriptomes. However, the batch-effect removal step is optional.

Cells are filtered based on the total UMI count in a cell specified by parameter min_count. Poor quality genes are removed based on the minimum number of cells min_count with expressions above a given threshold min_count.

# Filter poor quality cells.  A threshold th corresponds to the total count of a cell.
sce<-FilterCells(sce)
sce<-FilterGenes(sce)

Data normalization and removing poor quality genes

Count normalization is then performed with the good quality genes only. Normalized expression values is computed on the raw count data in a SingleCellExperiment object, using the median normalized total count.

sce<-CountNormalize(sce)

Selecting highly variable genes

Further gene selection is carried out by ranking the genes based on its dispersion index.

# Select Top Dispersed Genes by setting ngenes_keep.
sce<-RankGenes(sce, ngenes_keep = 1000)

Structure Preserving Sampling

Primary clustering is performed in a fast manner to estimate a gross structure of the data. Each of these clusters is then sampled to fine tune the clustering process.

sce<-Sampling(sce)

Gene selection based on PCA

Another gene selection is performed to reduce the number of dimensions. PCA is used to identify genes affecting major components.

# Find PCA top 200 genes. This may take some time.
sce<-RankPCAGenes(sce)

Clustering

Fine tuning the clustering process

By default best-fit, Louvain based clusters are returned. However, the user can tune the parameters to produce the desired number of clusters. The un-sampled transcriptomes are assigned cluster identifiers from among those identifiers produced from fine-tuning clustering. The post-hoc assignment can be controlled by setting the confidence value conf. High conf values will assign cluster identifiers to only those transcriptomes sharing a majority of common nearest neighbours.

# When `method = hclust`
# Adjust Minimum cluster size with argument minClusterSize (default = 20)
# Adjust tree cut with argument level deepSplit (default = 3), higher value produces more clusters.
sce<-Cluster(sce, method = "default", conf = 0.8)

Visualizing clusters

Compute 2D embeddings for samples followed by post-hoc clustering.

sce<-PlotEmbedding(sce, embedding = "umap", spread = 10, min_dist = 0.1)

plot_data = data.frame("Y1" = reducedDim(sce,"umap")[,1], Y2 = reducedDim(sce, "umap")[,2], color = sce$ClusterIDs)

ScatterPlot(plot_data,title = "Clusters")

Find cluster specific Differentially Expressed genes

DE_genes_all = FindMarkers(sce, selected_clusters=NA, lfc_th = 1, q_th =0.001, nDE=30)

write.csv(DE_genes_all$genes, 
          file = file.path(tempdir(),"ct_genes.csv"),
          quote = FALSE)

Plot hand picked marker genes

marker_genes = c("S100A8", "GNLY", "PF4")

p<-PlotMarkers(sce, marker_genes)

Heat map of top DE genes from each cluster

# Draw heatmap
p<-PlotHeatmap(sce, DE_res = DE_genes_all$DE_res,nDE = 10)

print(p)

Integrative analysis

Loading datasets

Each dataset represents one batch and must be a SingleCellExperiment object. The objects are are merged by passing a list in the next step.

library(dropClust)
load(url("https://raw.githubusercontent.com/LuyiTian/CellBench_data/master/data/sincell_with_class.RData"))

objects = list()

objects[[1]] = sce_sc_10x_qc

objects[[2]] = sce_sc_CELseq2_qc

objects[[3]] = sce_sc_Dropseq_qc

Merge datasets using dropClust

Datasets can be merged in two ways: using a set of DE genes from each batch or, using the union of the sets of highly variable genes from each batch.

Perform correction and dimension reduction

set.seed(1)
dc.corr <-  Correction(merged_data,  method="default", close_th = 0.1, cells_th = 0.1,
                       components = 10, n_neighbors = 30,  min_dist = 1)

Perform Clustering on integrated dimensions

dc.corr = Cluster(dc.corr,method = "kmeans",centers = 3)

Visualizing clusters

Compute 2D embeddings for samples followed by post-hoc clustering.

ScatterPlot(dc.corr, title = "Clusters")

Batch corrected dropClust based Clustering.

Optional Batch correction

Users can use fastmnn method for batchcorrection. Specific arguments of fastmnn can also be passed through the Correction module.

merged_data.fastmnn<-Merge(all.objects,use.de.genes = FALSE)
set.seed(1)
mnn.corr <-  Correction(merged_data.fastmnn,  method="fastmnn", d = 10)
mnn.corr = Cluster(mnn.corr,method = "kmeans",centers = 3)
ScatterPlot(mnn.corr, title = "Clusters")

Marker discovery from the merged dataset

de<-FindMarkers(dc.corr,q_th = 0.001, lfc_th = 1.2,nDE = 10)
de$genes.df
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].