All Projects → r3fang → snATAC

r3fang / snATAC

Licence: other
<<------ Use SnapATAC!!

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to snATAC

topometry
A comprehensive dimensional reduction framework to recover the latent topology from high-dimensional data.
Stars: ✭ 64 (+178.26%)
Mutual labels:  clustering, single-cell-genomics, single-cell-analysis
scarf
Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.
Stars: ✭ 54 (+134.78%)
Mutual labels:  clustering, single-cell-genomics, single-cell-atac-seq
seurat-wrappers
Community-provided extensions to Seurat
Stars: ✭ 169 (+634.78%)
Mutual labels:  single-cell-genomics, single-cell-analysis
SHARP
SHARP: Single-cell RNA-seq Hyper-fast and Accurate processing via ensemble Random Projection
Stars: ✭ 14 (-39.13%)
Mutual labels:  clustering, single-cell-analysis
cellrank
CellRank for directed single-cell fate mapping
Stars: ✭ 222 (+865.22%)
Mutual labels:  single-cell-genomics
tsp-essay
A fun study of some heuristics for the Travelling Salesman Problem.
Stars: ✭ 15 (-34.78%)
Mutual labels:  clustering
ExpressionMatrix2
Software for exploration of gene expression data from single-cell RNA sequencing.
Stars: ✭ 29 (+26.09%)
Mutual labels:  clustering
DESOM
🌐 Deep Embedded Self-Organizing Map: Joint Representation Learning and Self-Organization
Stars: ✭ 76 (+230.43%)
Mutual labels:  clustering
consul role
Ansible role to install Consul (cluster of) server/agent
Stars: ✭ 14 (-39.13%)
Mutual labels:  clustering
kohonen-maps
Implementation of SOM and GSOM
Stars: ✭ 62 (+169.57%)
Mutual labels:  clustering
rabbitmq-clusterer
This project is ABANDONWARE. Use https://www.rabbitmq.com/cluster-formation.html instead.
Stars: ✭ 72 (+213.04%)
Mutual labels:  clustering
NNet
algorithm for study: multi-layer-perceptron, cluster-graph, cnn, rnn, restricted boltzmann machine, bayesian network
Stars: ✭ 24 (+4.35%)
Mutual labels:  clustering
dyngen
Simulating single-cell data using gene regulatory networks 📠
Stars: ✭ 59 (+156.52%)
Mutual labels:  single-cell-analysis
rabbitmq-peer-discovery-consul
Consul-based peer discovery backend for RabbitMQ 3.7.0+
Stars: ✭ 39 (+69.57%)
Mutual labels:  clustering
EgoSplitting
A NetworkX implementation of "Ego-splitting Framework: from Non-Overlapping to Overlapping Clusters" (KDD 2017).
Stars: ✭ 78 (+239.13%)
Mutual labels:  clustering
clueminer
interactive clustering platform
Stars: ✭ 13 (-43.48%)
Mutual labels:  clustering
FixedEffectjlr
R interface for Fixed Effect Models
Stars: ✭ 20 (-13.04%)
Mutual labels:  clustering
IntroduceToEclicpseVert.x
This repository contains the code of Vert.x examples contained in my articles published on platforms such as kodcu.com, medium, dzone. How to run each example is described in its readme file.
Stars: ✭ 27 (+17.39%)
Mutual labels:  clustering
MAL-Map
Cluster and visualize relationships between anime on MyAnimeList
Stars: ✭ 201 (+773.91%)
Mutual labels:  clustering
scAlign
A deep learning-based tool for alignment and integration of single cell genomic data across multiple datasets, species, conditions, batches
Stars: ✭ 32 (+39.13%)
Mutual labels:  single-cell-genomics

NOTE: This project is no longer under maintenance, please find our latest pipeline SnapATAC here

Introduction

snATAC is a Ren-lab in-house bioinformatics pipeline for single-nucleus ATAC-seq (snATAC-seq). Download the our full data set from here.

Dependency

Quick Install

> git clone --depth=1 https://github.com/r3fang/snATAC.git
> cd snATAC
> chmod u+x bin/*
> gcc -g -O2 src/calculate_jacard_index_matrix.c -o bin/snATAC_jacard
> export PATH=$PATH:./bin/
> ./bin/snATAC

Program:  snATAC (snATAC-seq analysis pipeline)
Version:  12.24.2017
Contact:  Rongxin Fang <[email protected]>
          Sebastian Preissl <[email protected]>
          Bing Ren <[email protected]>

Usage:    snATAC <command> [options]

Command:  pre           preprocessing
          filter        filter reads with unselected barcodes
          bmat          binary accessible matrix
          jacard        jaccard index matrix

Optional (under development):
          decomplex     decomplex the fastq file
          bstat         simple statistics for barcode

Note: To use snATAC pipeline, you need to first decomplex barcode
combination and integrate barcodes to the beginning of the
read name in both R1 and R2 fastq files.

Requirement

Before alignment, you need to first decomplex the FASTQ file by adding the barcode to the beginning of each read. Below is one example of decomplexed R1 and R2 for snATAC-seq. Each read name follows this format: "@" + "barcode" + ":" + "original read_name". Unfortunately, because barcode design may be different between each experiment, we decided not to include this part in the pipeline.

> zcat p56.R1.fastq.gz | head
@TCCGGAGATAAGGCGAAAGGAGTAATAGAGGC:SN1113732HY737BCXX2110113481882 1N00
GTGTTGTTCTAGCTGGACAGGACAACTTCCTATCCTCCCCTTTAGCCCTA
+
DDDDDIIIIIIIIIIIIIIIHIIIIIIIHHIHIIIIIIIIHIIIHIIIHI

> zcat p56.R2.fastq.gz | head
@TCCGGAGATAAGGCGAAAGGAGTAATAGAGGC:SN1113732HY737BCXX2110113481882 2N00
CGGGCTCCTCGGCCGATATGTATGAGTAGGAAGGTGTCCTGTCCAGCTAG
+
<0000/11110000/<///<<11111<111<110<DCH1G<<<11111<1

Pipeline

snATAC has following steps:

  1. Alignment by bwa or bowtie2 (output.bam);
  2. Pre-processing (output.bed/output.qc);
    • Alignment filteration
    • Speration of single cell
    • PCR duplicates removal
    • Mitochondrial reads removal
    • Adjusting position of Tn5 insertion
    • Create a master .bed file
    • Create a .qc file
  3. Identify feature candidates (output.txt);
    • Call peaks using MACS2 from ensemble data
  4. Calculate barcode statistics;
    • Reads per barcode
    • Consecutive promoter coverage
    • Reads in peak ratio
  5. Barcode selection (output.xgi);
    • Reads per barcode >= 1000
    • Consecutive promoter coverage > 5%
    • Reads in peak ratio >= 20%
    • NOTE: above cutoff can vary singificant between different samples
    • Filter potential doublets
  6. Feature selection (output.ygi);
    • Filter top 5% peaks
    • Filter promoters
  7. Binary accesibility matrix (output.mat);
  8. Jaccard index matrix (output.jacard);
  9. Dimentionality reduction (output.tsne);
  10. Density-based clustering (output.cluster);

A Complete Example

Step 0. Download sample data.

> wget http://renlab.sdsc.edu/r3fang/projects/Preissl_Nat_Neuro/data_raw/p56.rep1.R1.decomplex.fastq.gz
> wget http://renlab.sdsc.edu/r3fang/projects/Preissl_Nat_Neuro/data_raw/p56.rep1.R2.decomplex.fastq.gz
> wget http://renlab.sdsc.edu/r3fang/snATAC/mm10.tar.gz
> tar -xvzf mm10.tar.gz

Step 1. Alignment using bwa (or bowtie2) in pair-end.

> bwa mem -t 1 mm10.fa \
          p56.rep1.R1.decomplex.fastq.gz \
          p56.rep1.R2.decomplex.fastq.gz \
          | samtools view -bS - > p56.rep1.bam
> samtools sort -t 5 -n p56.rep1.bam -o p56.rep1.nsrt.bam

Step 2. Pre-processing.

> ./bin/snATAC pre -t 5 -m 30 -f 2000 -e 75 \
                   -i p56.rep1.nsrt.bam \
                   -o p56.rep1.bed.gz 2> p56.rep1.pre.log

./bin/snATAC pre will output two files p56.pre.bed.gz and p56.rep1.pre.log. p56.pre.bed.gz contains read and corresponding barcode, p56.rep1.pre.log contains basic quality control. Below is one example of p56.rep1.pre.log.

p56.pre.log
number of totol reads 217064143
number of uniquely mapped reads 191813608
number of properly paired reads 189561202
number of chrM reads 5334929
number of usable reads 184310563
number of distinct reads 52036692
estimated duplicate rate 0.71

Step 3. Identify feature candidates (output.txt)

> macs2 callpeak -t p56.rep1.bed.gz \
                 -f BED -n p56.rep1 \
                 -g mm -p 0.05 \
                 --nomodel --shift 150 \
                 --keep-dup all

Step 4. Calculate barcode statistics

##################################################################
# calculate 3 major barcode stats
# 1) number of reads
# 2) consecutive promoter coverage 
# 3) reads in peak ratio
##################################################################
# count number of reads per barcode
> zcat p56.rep1.bed.gz | awk '{print $4}' \
               | sort \
               | uniq -c \
               | awk '{print $2, $1}' \
               | sort -k1,1 > p56.rep1.reads_per_cell
# consecutive promoter coverage 
> intersectBed -wa -wb \
               -a p56.rep1.bed.gz \
               -b mm10_consecutive_promoters.bed \
               | awk '{print $4, $8}' \
               | sort \
               | uniq \ 
               | awk '{print $1}' \
               | uniq -c \
               | awk '{print $2, $1}' \
               | sort -k1,1 > p56.rep1.promoter_cov 
# reads in peak ratio
> intersectBed -a p56.rep1.bed.gz -b p56.rep1_peaks.narrowPeak -u \
               | awk '{print $4}' \
               | sort \
               | uniq -c \
               | awk '{print $2, $1}' \
               | sort -k1,1 - > p56.rep1.reads_in_peak

Step 5. Cell selection (output.xgi)

> R
##################################################################
# 1) reads per barcode >= 1000
# 2) consecutive promoter coverage > 3%
# 3) reads in peak ratio >= 20%
# NOTE: The cutoff can vary singificantly between different samples
# 4) filter potential doublets (OPTIONAL NOT SUGGUESTED, UNSTABLE)
##################################################################

consecutive_promoters <- read.table("mm10/mm10_consecutive_promoters.bed")
num_of_reads = read.table("p56.rep1.reads_per_cell")
promoter_cov = read.table("p56.rep1.promoter_cov")
read_in_peak = read.table("p56.rep1.reads_in_peak")
qc = num_of_reads; 
colnames(qc) <- c("barcode", "num_of_reads")
qc$promoter_cov = 0; 
qc$read_in_peak = 0;
qc$promoter_cov[match(promoter_cov$V1, qc$barcode)] = promoter_cov$V2/nrow(consecutive_promoters)
qc$read_in_peak[match(read_in_peak$V1, qc$barcode)] = read_in_peak$V2
qc$ratio = qc$read_in_peak/qc$num_of_reads
idx <- which(qc$promoter_cov > 0.09 & qc$ratio > 0.2 & qc$num_of_reads > 1000)
qc_sel <- qc[idx,]

# among these cells, further filter PUTATIVE DOUBLETS ((OPTIONAL NOT SUGGUESTED, UNSTABLE))
#pvalues <- sapply(qc_sel$num_of_reads, function(x) 
#           poisson.test(x, mean(qc_sel$num_of_reads), 
#           alternative="greater")$p.value)
#fdrs <- p.adjust(pvalues, "BH")
#idx <- which(fdrs < 1e-2)

write.table(qc_sel[,1], file = "p56.rep1.xgi", append = FALSE, 
			  quote = FALSE, sep = "\t", eol = "\n", 
			  na = "NA", dec = ".", row.names = FALSE,
			  col.names = FALSE, qmethod = c("escape", "double"),
			  fileEncoding = "")

Step 5. Feature selection (output.xgi);

> R
##################################################################
# 1) Filter top 5% peaks
# 2) Filter promoters
# 3) extend and merge
##################################################################
library(GenomicRanges)
peaks.df <- read.table("p56.rep1_peaks.narrowPeak")
# remove top 5% peaks
cutoff <- quantile((peaks.df$V5), probs = 0.95)
peaks.df <- peaks.df[which(peaks.df$V5 < cutoff),]
proms.df <- read.table("mm10/mm10.refSeq_promoter.bed")
peaks.gr <- GRanges(peaks.df[,1], IRanges(peaks.df[,2], peaks.df[,3]))
proms.gr <- GRanges(proms.df[,1], IRanges(proms.df[,2], proms.df[,3]))

peaks.sel.gr <- peaks.gr[-queryHits(findOverlaps(peaks.gr, proms.gr))]
peaks.sel.ex.gr <- resize(reduce(resize(peaks.sel.gr, 1000, 
                          fix="center")), 1000, fix="center")

peaks.sel.ex.df <- as.data.frame(peaks.sel.ex.gr)[,1:3]
write.table(peaks.sel.ex.df, file = "p56.rep1.ygi", 
			  append = FALSE, quote = FALSE, sep = "\t", 
			  eol = "\n", na = "NA", dec = ".", 
			  row.names = FALSE, col.names = FALSE, 
			  qmethod = c("escape", "double"),
			  fileEncoding = "")

Step 6. Generate binary accessibility matrix (NOTE: this may require large RAM)

> ./bin/snATAC bmat -i p56.rep1.bed.gz \
                    -x p56.rep1.xgi \
                    -y p56.rep1.ygi \
                    -o p56.rep1.mat

Step 7. Calculate jaccard index (NOTE: snATAC jacard may require large memory when you have more cells, "Segmentation fault" means you need larger RAM)

# mannually count number of rows (1,464)
> wc -l p56.rep1.xgi
# mannually count number of columns (184,519)
> wc -l p56.rep1.ygi
# calculate jaccard index matrix
> ./bin/snATAC jacard -i p56.rep1.mat -x 1464 -y 184519 -o p56.rep1.jacard

Step 8. Dimentionality reduction (R) (NOTE: tSNE is a heristic method, it is expected that different run can have slightly different result)

> R
##################################################################
# Dimentionality reduction using t-SNE
# Please refer to https://lvdmaaten.github.io/tsne/ for how to
# tune parameters such as perplexity. We find most of time, it 
# takes less than 500 iteration to converge.
##################################################################

library(tsne)

data <- as.matrix(read.table("p56.rep1.jacard"))
diag(data) <- 0
b = tsne(data/sum(data), initial_config = NULL, 
		  k = 2, perplexity = 30, max_iter = 500, 
		  min_cost = 0, epoch_callback = NULL, 
		  whiten = TRUE, epoch=100)

plot(b, cex=0.7, xlab="tsne1", ylab="tsne2")
write.table(b, file = "p56.rep1.tsne", append = FALSE, 
            quote = FALSE, sep = "\t", eol = "\n", 
            na = "NA", dec = ".", row.names = FALSE,
            col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")

plot

Step 9. Density-based clustering

> R
##################################################################
# NOTE: Choosing cluster number is absolutely an art! In our paper, 
# we adopted a method originally published from Habib et al. that
# determines the best cluster number by optimizing Dunn Index.
# However, such approach is extremely slow with time complexity
# O(n^3), without optimization and better implementation, it is
# almost impossible to run on a sample of thousands of cells.
# Therefore, we decided to switch to a heuristic approach as shown
# below. Though, it does not guarantee the best result (in fact, 
# none of the methods do), from our experience, it gives satisfactory
# result. At the end of the day, choosing the optimal cluster number
# is very tricky task and it is hard to be fully automatic. We still
# post our code for Dunn Index approach as described in our paper. 
##################################################################

library(densityClust)
MaxStep <- function(D){
	D_hat <- D
	n = nrow(D)
	for(k in 1:n){
		for(i in 1:(n-1)){
			for(j in (i+1):n){
				D_hat[i,j] <- min(D_hat[i,j], 
				max(D_hat[k,j], D_hat[i,k]))
			}
		}
	}
	return(D_hat)
}
find_center <- function(p, clust){
	centers <- data.frame()
	cols <- c()
	for(i in as.numeric(names(table(clust)))){
		ind <- which(clust== i)
		if(length(ind) > 10){
			centers <- rbind(centers, colMeans(p[ind,1:3]))
			cols <- c(cols, i)
		}
	}
	res <- cbind(centers, cols)
	colnames(res) <- c("x", "y", "z", "col")
	return(res)
}
DB_index <- function(D, CL){
	cl_uniq <- unique(CL)
	n <- length(cl_uniq)
	d_k <- rep(0, n)
	d_ij <- matrix(0, n, n)
	for(i in 1:n){
		ii <- which(CL == i);
		d_k[i] <- max(MaxStep(D[ii,ii]))
	}
	for(i in 1:(n-1)){
		for(j in (i+1):n){
			ii <- which(CL == i | CL == j)
			d_ij <- max(MaxStep(D[ii,ii]))
		}
	}
	return(min(d_ij)/max(d_k))
}

# perform cluster
points <- read.table("p56.rep1.tsne")

set.seed(10)
dis <- dist(points)
irisClust <- densityClust(dis, gaussian=TRUE)
# plot decision graph
plot(irisClust)

rho_cutoff <- irisClust$dc
delta_cutoff <- 20
irisClust <- findClusters(irisClust, 
                          rho= rho_cutoff, 
                          delta= delta_cutoff)
clusters <- irisClust$cluster

plot(points, col=clusters, cex=0.5, pch=19)
dev.off()

write.table(data.frame(cluster), file = "p56.rep1.cluster",
			append = FALSE, quote = FALSE, sep = "\t",
            eol = "\n", na = "NA", dec = ".", row.names = FALSE,
            col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")

cluster

Cite us

Preissl S.*, Fang R.*, Zhao Y., Raviram R., Zhang Y., Brandon C.S., Huang H., Gorkin D.U., Afzal V., Dickel D.E., Kuan S., Visel A., Pennacchio L.A., Zhang K., Ren B. Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat Neurosci. 2018 Feb 12. doi: 10.1038/s41593-018-0079-3.(* contributed equally)

File Organization

.
# raw data
|____p56.rep1.R1.decomplex.fastq.gz
|____p56.rep1.R2.decomplex.fastq.gz
# pre-defined features for mm10
|____mm10
| |____mm10.genome.size
| |____mm10_consecutive_promoters.bed
| |____mm10.refSeq_promoter.bed
# output from alignment
|____p56.rep1.bam
# output from pre-proessing
|____p56.rep1.log
|____p56.rep1.bed.gz
# output from MACS2
|____p56.rep1_peaks.xls
|____p56.rep1_peaks.narrowPeak
|____p56.rep1_summits.bed
# output from barcode stats
|____p56.rep1.reads_in_peak
|____p56.rep1.promoter_cov
|____p56.rep1.reads_per_cell
# output from barcode selection
|____p56.rep1.xgi
# output from feature selection
|____p56.rep1.ygi
# output from binary accessibility matrix
|____p56.rep1.mat
# output from jaccard matrix
|____p56.rep1.jacard
# output from dimentionality reduction
|____p56.rep1.tsne
# output from clustering
|____p56.rep1.cluster
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].