All Projects → ay-lab → dcHiC

ay-lab / dcHiC

Licence: MIT license
dcHiC: Differential compartment analysis for Hi-C datasets

Programming Languages

r
7636 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to dcHiC

HiC data
A (continuously updated) collection of references to Hi-C data. Predominantly human/mouse Hi-C data, with replicates.
Stars: ✭ 107 (+282.14%)
Mutual labels:  hi-c, chromatin
TADLib
A Library to Explore Chromatin Interaction Patterns for Topologically Associating Domains
Stars: ✭ 23 (-17.86%)
Mutual labels:  hi-c, chromatin
Genomepy
Download and use genomes the easy way.
Stars: ✭ 209 (+646.43%)
Mutual labels:  genome
NanoSim
Nanopore sequence read simulator
Stars: ✭ 156 (+457.14%)
Mutual labels:  genome
DNA-Sequence-Machine-learning
Understand DNA structure and how machine learning can be used to work with DNA sequence data.
Stars: ✭ 25 (-10.71%)
Mutual labels:  genome
genome updater
Bash script to download/update snapshots of files from NCBI genomes repository (refseq/genbank) with track of changes and without redundancy
Stars: ✭ 93 (+232.14%)
Mutual labels:  genome
valr
Genome Interval Arithmetic in R
Stars: ✭ 78 (+178.57%)
Mutual labels:  genome
Karyoploter
karyoploteR - An R/Bioconductor package to plot arbitrary data along the genome
Stars: ✭ 192 (+585.71%)
Mutual labels:  genome
SARS-CoV-2-Sequenzdaten aus Deutschland
Das Robert Koch-Institut stellt Systeme zur bundesweiten molekularen Surveillance des SRARS-CoV-2-Virus bereit. Jedes Labor in Deutschland, das SARS-CoV-2 sequenziert, ist laut der Verordnung zur molekulargenetischen Surveillance des Coronavirus SARS-CoV-2 verpflichtet, dem Robert Koch-Institut die Sequenz- und zugehörige Metadaten zu übermittel…
Stars: ✭ 66 (+135.71%)
Mutual labels:  genome
CUT-RUNTools-2.0
CUT&RUN and CUT&Tag data processing and analysis
Stars: ✭ 36 (+28.57%)
Mutual labels:  chromatin
LTRpred
De novo annotation of young retrotransposons
Stars: ✭ 35 (+25%)
Mutual labels:  genome
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (-21.43%)
Mutual labels:  genome
arv
A fast 23andMe DNA parser and inferrer for Python
Stars: ✭ 98 (+250%)
Mutual labels:  genome
Abyss
🔬 Assemble large genomes using short reads
Stars: ✭ 219 (+682.14%)
Mutual labels:  genome
pyrodigal
Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
Stars: ✭ 38 (+35.71%)
Mutual labels:  genome
Deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
Stars: ✭ 2,404 (+8485.71%)
Mutual labels:  genome
3d-genome-processing-tutorial
A 3D genome data processing tutorial for ISMB/ECCB 2017
Stars: ✭ 44 (+57.14%)
Mutual labels:  hi-c
instaGRAAL
Large genome reassembly based on Hi-C data, continuation of GRAAL
Stars: ✭ 32 (+14.29%)
Mutual labels:  hi-c
mustache
Multi-scale Detection of Chromatin Loops from Hi-C and Micro-C Maps using Scale-Space Representation
Stars: ✭ 38 (+35.71%)
Mutual labels:  hi-c
hickit
TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C
Stars: ✭ 79 (+182.14%)
Mutual labels:  hi-c

dcHiC: Differential Compartment Analysis of Hi-C Datasets License: MIT

dcHiC is a tool for differential compartment analysis of Hi-C datasets. This latest version marks a substantial update from our first release (under the branch "dcHiC-v1"), and remains the only tool to perform Hi-C compartment analyses between multiple datasets. It features many capabilities, including:

  • Optimized PCA calculations (faster + capable of analysis up to 5kb resolution)
  • Comprehensive identification of significant compartment changes between any number of cell lines (with replicates), including with pseudo-bulk single cell data
  • Beautiful standalone HTML files for visualization of results
  • Identification of differential loops anchored in significant differential compartments (using Fit-Hi-C)
  • Gene Ontology annotation of differential compartments

While we hope that all users try the latest version of dcHiC, all code and documentation for the first version remains and we will continue offering support for it into the future.

Paper

If you want to cite our tool, please cite our preprint. See web-hosted visualization examples of case scenarios in the new version here. To see how to run dcHiC, read our docs and try our demo (below)! Information about data pre-processing and running single-cell data is available in the wiki.

Demo

This README contains the key information you will need to use this application. However, some users may find a demo helpful—ours includes a script to run package installation as well as detailed guides for different options of dcHiC. All of these resources are available in packages/dchic_demo.zip, with relevant instructions inside!

Installation

The latest version of dcHiC runs pre-dominantly from R (3+) and Python (3+). The necessary packages may be installed via conda or manually (those transitioning environments should have most, if not all, of the packages already installed). For the core application, the following packages are necessary:

Option 1: Conda

We recommend using Conda to install all dependencies in a virtual environment. The suggested path is using the appropriate Miniconda distribution.

If you face any issues, be sure your "conda" command specifically calls the executable under the miniconda distribution (e.g., ~/miniconda3/condabin/conda). If "conda activate" command gives an error when you run it the first time then you will have to run "conda init bash" once.

To install, go to the directory of your choice and run:

git clone https://github.com/ay-lab/dcHiC
conda env create -f ./packages/dchic.yml
conda activate dchic

Afterward, activate the environment and install some purpose-built processing functions with R CMD INSTALL functionsdchic_1.0.tar.gz (functions file under 'packages').

Option 2: Manual Installation

To install the dependencies manually, ensure that you have the following packages installed:

Packages in R

  • Rcpp
  • optparse
  • bench
  • bigstatsr
  • bigreadr
  • robust
  • data.table
  • networkD3
  • depmixS4
  • rjson
  • limma (bioconductor)
  • IHW (bioconductor)
  • R.utils
  • hashmap (.tar.gz file under 'packages')

Packages in Python

  • igv-reports

Those who wish to perform differential loop analysis should also download the latest Python version of FitHiC, which requires a set of Python libraries: numpy, scipy, sk-learn, sortedcontainers, and matplotlib. You may also need to install 'cooler' if you wish to use .cool files. See documentation on how to do so.

Afterward, activate the environment and install some purpose-built processing functions with R CMD INSTALL functionsdchic_1.0.tar.gz (functions file under 'packages').

Check which R packages are already installed

Rscript -e 'plist <- c("functionsdchic","hashmap","R.utils","Rcpp","RcppEigen","BH","optparse","bench","bigstatsr","bigreadr","robust","data.table","networkD3","depmixS4","rjson","limma","IHW"); setdiff(plist,basename(find.package(plist)))'

If you get character(0) then you're all set, otherwise install the packages shown in the output.

Input File

Create an input file for dcHiC with the format below. The matrix and bed columns are for input data (see next section), whereas the replicate_prefix and experiment_prefix columns describe the hierarchy of data.

Note: Do not use dashes ("-") or dots (".") in the replicate or experiment prefix names.

<mat>         <bed>         <replicate_prefix>      <experiment_prefix>

For instance, consider this sample file which describes two replicates for two Hi-C profiles:

matr1_e1.txt  matr1_e1.bed   exp1_R1_100kb                  exp1
matr2_e1.txt  matr2_e2.bed   exp1_R2_100kb                  exp1
matr1_e2.txt  matr1_e2.bed   exp2_R1_100kb                  exp2
matr2_e2.txt  matr2_e2.bed   exp2_R2_100kb                  exp2

Input Data

dcHiC accepts sparse matrices as its input (Hi-C Pro style). If you have .cool or .hic files, see how to convert their format here.

To see the full list of options, run Rscript dchicf.r --help or view dchicdoc.txt here.

The matrix file should look like this:

<indexA> <indexB> <count>

1         1       300
1         2       30
1         3       10
2         2       200
2         3       20
3         3       200
 			....

... And the corresponding bed file like this:

<chr>	<start>	<end>	<index>

chr1	0	      40000	   1
chr1	40000	  80000	   2
chr1	80000	  120000	 3
 			....

Blacklisted Regions

Many high-throughput genomics studies "blacklist" problematic mapping regions (see the study here). If you wish to blacklist regions from your data, you may do so by adding a fifth column to your input file containing 1's in rows that should be blacklisted:

<chr>	<start>	<end>	<index>	<blacklisted>

chr1	0	      40000	 1	     0
chr1	40000	  80000	 2	     1
 			....

Run Options

To see the full list of run options with examples of run code for each one, run Rscript dchicf.r --help. The most high-level option is --pcatype, which allows users to perform different types of step-wise analysis. Each of these run options will require other input information.

--pcatype option Meaning
cis Find compartments on a cis interaction matrix
trans Find compartments on a trans interaction matrix
select Selection of best PC for downstream analysis [Must be after cis or trans step]
analyze Perform differential analysis on selected PC's [Must be after select step]
subcomp Optional: Assigning sub-compartments based on PC magnitude values using HMM segmentation
fithic Run Fit-Hi-C to identify loops before running dloop (Optional but recommended)
dloop Find differential loops anchored in at least one of the differential compartments across the samples (Optional but recommended)
viz Generate IGV vizualization HTML file. Must have performed other steps in order (optional ones not strictly necessary) before this one.
enrich Perform gene enrichment analysis (GSEA) of genes in differential compartments/loops

Here is a sample full run using the traditional cis matrix for compartment analysis:

Rscript dchicf.r --file input.ES_NPC.txt --pcatype cis --dirovwt T --cthread 2 --pthread 4
Rscript dchicf.r --file input.ES_NPC.txt --pcatype select --dirovwt T --genome mm10
Rscript dchicf.r --file input.ES_NPC.txt --pcatype analyze --dirovwt T --diffdir ES_vs_NPC_100Kb
Rscript dchicf.r --file input.ES_NPC.txt --pcatype fithic --dirovwt T --diffdir ES_vs_NPC_100Kb --fithicpath "/path/to/fithic.py" --pythonpath "/path/to/python"
Rscript dchicf.r --file input.ES_NPC.txt --pcatype dloop --dirovwt T --diffdir ES_vs_NPC_100Kb
Rscript dchicf.r --file input.ES_NPC.txt --pcatype subcomp --dirovwt T --diffdir ES_vs_NPC_100Kb
Rscript dchicf.r --file input.ES_NPC.txt --pcatype viz --diffdir ES_vs_NPC_100Kb --genome mm10 
Rscript dchicf.r --file input.txt --pcatype enrich --genome mm10 --diffdir conditionA_vs_conditionB --exclA F --region both --pcgroup pcQnm --interaction intra --pcscore F --compare F

Output

As output, dcHiC creates two types of directories. The first are raw PCA results, in directories named after the third column of the input file. One of these is created for each input Hi-C profile; inside, there will be directories "intra_pca" or "inter_pca" depending on whether the user specified compartment calculations based on intra- or inter-chromosomal interactions and raw PC values for each chromosome inside each one.

The second overarching directory is called DifferentialResult, which contains directories for differential results (on any number of parameter settings). These directory names are specified under the -analyze pcatype option (which performs differential calling) dcHiC where users denote a --diffdir where they want the analysis to be done. Multiple directories, with different analysis parameters, can be stored under the global DifferentialResult directory.

Inside each diffdir, there are raw compartment results ("expXX_data") and two PC output directories PcOri and PcQnm with combined and quantile-normalized compartment results. Finally, there will be a directory fdr_result containing differential compartment, loop, and subcompartment results. Inside fdr_result, the sample_combined files contain complete bedGraphs with average PC values across replicates for all XX cell lines, as well as a final adjusted p-value denoting the significance of changes between Hi-C experiments for that compartment bin. The sample_combined.Filtered files contain the same information, filtered by a p-value cutoff.

Other subcompartments and compartmentLoops may be there depending on whether the user opted to run those options. The differential loop files list significant loop interactions and their associated differential compartment anchors, whereas the subcompartment files illustrate HMM-segmented subcompartments based on the magnitude of the PC values.

Below is a diagram of the overarching results structure, containing two different runs (

dcHiC_dir
 exp1_rep1_100kb_pca
   intra_pca
      [files]
   inter_pca
      [files]
 exp1_rep2_100kb_pca
 exp2_rep1_100kb_pca
 exp2_rep2_100kb_pca
 DifferentialResult
   inter_100kb_diff
     [files]
   intra_100kb_diff
     exp1_data
     exp2_data
     fdr_result
     fithic_run
     geneEnrichment
     pcOri
     pcQnm
     viz

Technical Specifications

There are a few technical implementation items to note:

If you are running into issues during running dcHiC, removing chrM, chrY and other non-standard chromosomes may help.

Support for other genomes: While it has only been extensively tested for human and mouse genomes, dcHiC supports most other commonly-used genomes that are under the UCSC genome page. To utilize this, create a folder *{genome}_{resolution}_goldenpathData* (e.g hg38_100000_goldenpathData).

Within that folder put three files:

  • {genome}.fa (e.g. hg38.fa)
  • {genome}.tss.bed (e.g. hg38.tss.bed, the TSS file. Please make sure the TSS position is selected based on the strad direction!)
  • {genome}.chrom.sizes (e.g. hg38.chrom.sizes).

These files can be found under the UCSC bigZips page for the specified genome. When running dcHiC use the --gfolder option in the select step to provide the folder path, and dcHiC will create the necessary files.

Compartment clustering: Due to statistical noise, edge cases, and other factors, lone differential compartments occassionally crop up (ex: one bin is "significant" but all of its neighbors are not). These may be significant if analyzing at coarse resolution, but can also be misleading, especially if analyzing at very fine resolution. By default, dcHiC does not filter any of these lone compartments; however, there are two parameters to do so: distclust is the distance threshold for close differential regions to be a "cluster." If it's 0, only adjacent differential compartments form a cluster. If it's 1, differential compartments separated by up to 1 bin are a cluster. The other parameter is numberclust, which is a filter for the minimum number of significant bins within a cluster.

Quantile Normalization: Comparing raw Hi-C compartment values can be somewhat risky, as the quantitative nature of compartment profiles can vary between experiments (due to assay biases like crosslinking behavior, restriction enzyme, etc). As such, dcHiC quantile-normalizes PC values before performing differential calling, although raw results are also given.

Contact

For help with installation, technical issues, interpretation, or other details, feel free to raise an issue or contact us:

Abhijit Chakraborty ([email protected]), Jeffrey Wang ([email protected]), Ferhat Ay ([email protected])

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].