single-cell-genetics / cellSNP

Licence: Apache-2.0 License
Pileup biallelic SNPs from single-cell and bulk RNA-seq data

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to cellSNP

cellsnp-lite
Efficient genotyping bi-allelic SNPs on single cells
Stars: ✭ 47 (+11.9%)
Mutual labels:  single-cell, genetic-variants, genotyping
alevin-fry
🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
Stars: ✭ 78 (+85.71%)
Mutual labels:  rna-seq, single-cell
OrchestratingSingleCellAnalysis-release
An online companion to the OSCA manuscript demonstrating Bioconductor resources and workflows for single-cell RNA-seq analysis.
Stars: ✭ 35 (-16.67%)
Mutual labels:  rna-seq, single-cell
velodyn
Dynamical systems methods for RNA velocity analysis
Stars: ✭ 16 (-61.9%)
Mutual labels:  rna-seq, single-cell
kana
Single cell analysis in the browser
Stars: ✭ 81 (+92.86%)
Mutual labels:  rna-seq, single-cell
kallistobustools
kallisto | bustools workflow for pre-processing single-cell RNA-seq data
Stars: ✭ 79 (+88.1%)
Mutual labels:  rna-seq, single-cell
NGS
Next-Gen Sequencing tools from the Horvath Lab
Stars: ✭ 30 (-28.57%)
Mutual labels:  rna-seq, single-cell
scCATCH
Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data
Stars: ✭ 137 (+226.19%)
Mutual labels:  rna-seq, single-cell
iDEA
Differential expression (DE); gene set Enrichment Analysis (GSEA); single cell RNAseq studies (scRNAseq)
Stars: ✭ 23 (-45.24%)
Mutual labels:  rna-seq, single-cell
ideal
Interactive Differential Expression AnaLysis - DE made accessible and reproducible
Stars: ✭ 24 (-42.86%)
Mutual labels:  rna-seq
MetaOmGraph
MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets
Stars: ✭ 30 (-28.57%)
Mutual labels:  rna-seq
MERINGUE
characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomics data with nonuniform cellular densities
Stars: ✭ 33 (-21.43%)
Mutual labels:  single-cell
SpiceMix
spatial transcriptome, single cell
Stars: ✭ 19 (-54.76%)
Mutual labels:  single-cell
FEELnc
FEELnc : FlExible Extraction of LncRNA
Stars: ✭ 61 (+45.24%)
Mutual labels:  rna-seq
single-cell-papers-with-code
Papers with code for single cell related papers
Stars: ✭ 20 (-52.38%)
Mutual labels:  single-cell
CICERO
CICERO: a versatile method for detecting complex and diverse driver fusions using cancer RNA sequencing data.
Stars: ✭ 19 (-54.76%)
Mutual labels:  rna-seq
diffexpr
Porting DESeq2 and DEXSeq into python via rpy2
Stars: ✭ 49 (+16.67%)
Mutual labels:  rna-seq
dbMAP
A fast, accurate, and modularized dimensionality reduction approach based on diffusion harmonics and graph layouts. Escalates to millions of samples on a personal laptop. Adds high-dimensional big data intrinsic structure to your clustering and data visualization workflow.
Stars: ✭ 39 (-7.14%)
Mutual labels:  single-cell
CoNekT
CoNekT (short for Co-expression Network Toolkit) is a platform to browse co-expression data and enable cross-species comparisons.
Stars: ✭ 17 (-59.52%)
Mutual labels:  rna-seq
echtvar
echt rapid variant annotation and filtering
Stars: ✭ 72 (+71.43%)
Mutual labels:  genetic-variants

cellSNP

PyPI Build Status DOI

cellSNP aims to pileup the expressed alleles in single-cell or bulk RNA-seq data, which can be directly used for donor deconvolution in multiplexed single-cell RNA-seq data, particularly with vireo, which assigns cells to donors and detects doublets, even without genotyping reference.

cellSNP heavily depends on pysam, a Python interface for samtools and bcftools. This program should give very similar results as samtools/bcftools mpileup. Also, there are two major differences comparing to bcftools mpileup:

  1. cellSNP can pileup either the whole genome or a list of positions, with directly splitting into a list of cell barcodes, e.g., for 10x genome. With bcftools, you may need to manipulate the RG tag in the bam file if you want to divide reads into cell barcode groups.
  2. cellSNP uses simple filtering for outputting SNPs, i.e., total UMIs or counts and minor alleles fractions. The idea here is to keep most information of SNPs and the downstream statistical model can take the full use of it.

cellSNP has now a C version named cellsnp-lite, which is basically more efficient with higher speed and less memory usage.

News

We recommend cellsnp-lite instead of cellSNP if you would like to use mode 2. For now, cellSNP mode 2 uses pileup() function of pysam, which would filter duplicates and orphan reads by default and may lead to unexpected coverage reduction in some cases. Compared to cellSNP, cellsnp-lite provides a more flexible reads filtering for mode 2 so you could tune the filtering parameters on your demand.

We have turn off the PCR duplicate filtering by default (--maxFLAG), as it is not well flagged in CellRanger, hence may result in loss of a substantial fraction of SNPs. Please use v0.3.1 or setting --maxFLAG to large number. Credits to issue13.

All release notes can be found in doc/release.rst.

For computational efficiency, we initialised comments on this: doc/speed.rst

Citation

If you find cellSNP (the predecessor of cellsnp-lite) is useful for your research, please cite:

Xianjie Huang, Yuanhua Huang, Cellsnp-lite: an efficient tool for genotyping single cells, Bioinformatics, 2021;, btab358, https://doi.org/10.1093/bioinformatics/btab358

Installation

cellSNP is available through pypi. To install, type the following command line, and add -U for upgrading:

pip install -U cellSNP

Alternatively, you can install from this GitHub repository for latest (often development) version by following command line

pip install -U git+https://github.com/single-cell-genetics/cellSNP

In either case, if you don't have write permission for your current Python environment, we suggest creating a separate conda environment or add --user for your current one.

Quick usage

Once installed, check all arguments by type cellSNP -h (see a snapshot) There are three modes of cellSNP:

  • Mode 1: pileup a list of SNPs for a single BAM/SAM file

Use both -R and -b.

Require: a single BAM/SAM file, e.g., from cellranger, a list of cell barcodes, a VCF file for common SNPs. This mode is recommended comparing to mode 2, if a list of common SNP is known, e.g., human (see Candidate SNPs below)

cellSNP -s $BAM -b $BARCODE -O $OUT_DIR -R $REGION_VCF -p 20 --minMAF 0.1 --minCOUNT 20

As shown in the above command line, we recommend filtering SNPs with <20UMIs or <10% minor alleles for downstream donor deconvolution, by adding --minMAF 0.1 --minCOUNT 20

Besides, special care needs to be taken when filtering PCR duplicates for scRNA-seq data by setting maxFLAG to a small value, for the upstream pipeline may mark each extra read sharing the same CB/UMI pair as PCR duplicate, which will result in most variant data being lost. Due to the reason above, cellSNP by default uses a large maxFLAG value to include PCR duplicates for scRNA-seq data when UMItag is turned on.

  • Mode 2: pileup whole chromosome(s) for a single BAM/SAM file

Don't use -R but flexible on -b.

This mode requires inputting a single bam file with either cell barcoded (add -b) or a bulk sample:

# 10x sample with cell barcodes
cellSNP -s $BAM -b $BARCODE -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100

# a bulk sample without cell barcodes and UMI tag
cellSNP -s $bulkBAM -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --UMItag None

Add --chrom if you only want to genotype specific chromosomes, e.g., 1,2, or chrMT.

Recommend filtering SNPs with <100UMIs or <10% minor alleles for saving space and speed up inference when pileup whole genome: --minMAF 0.1 --minCOUNT 100

Note, this mode may output false positive SNPs, for example somatic variants or falses caussed by RNA editing. These false SNPs are probably not consistent in all cells within one individual, hence confounding the demultiplexing. Nevertheless, for species, e.g., zebrafish, without a good list of common SNPs, this strategy is still worth a good try, and it does not take much more time than mode 1.

Update: We recommend cellsnp-lite instead of cellSNP if you would like to use mode 2. For now, cellSNP mode 2 uses pileup() function of pysam, which would filter duplicates and orphan reads by default and may lead to unexpected coverage reduction in some cases. Compared to cellSNP, cellsnp-lite provides a more flexible reads filtering for mode 2 so you could tune the filtering parameters on your demand.

  • Mode 3: pileup a list of SNPs for one or multiple BAM/SAM files

Use -R but not -b.

Require: one or multiple BAM/SAM files (bulk or smart-seq), their according sample ids (optional), and a VCF file for a list of common SNPs. BAM/SAM files can be input in comma separated way (-s) or in a list file (-S).

cellSNP -s $BAM1,$BAM2,$BAM3 -I sample_id1,sample_id2,sample_id3 -o $OUT_FILE -R $REGION_VCF -p 20 --UMItag None

cellSNP -S $BAM_list_file -I sample_list_file -o $OUT_FILE -R $REGION_VCF -p 20 --UMItag None

Set filtering thresholds according to the downstream analysis. Please add --UMItag None if you bam file does not have UMIs, e.g., smart-seq and bulk RNA-seq.

List of candidate SNPs

A quality list of candidate SNPs (ususally common SNPs) are important for mode 1 and mode 3. If a list of genotyped SNPs is available, it can be used to pile up. Alternatively, for human, common SNPs in population that have been idenetified from consortiums can also be very good candidates, e.g., gnomAD and 1000_Genome_Project. For the latter, we have compiled a list of 7.4 million common variants (AF>5%) with this bash script and stored in this folder.

In case you want to lift over SNP positions in vcf file from one genome build to another, see our LiftOver_vcf wrap function.

FAQ and releases

For troubleshooting, please have a look of FAQ.rst, and we welcome reporting any issue.

All releases are included in pypi. Notes for each release are recorded in release.rst.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].