All Projects → vpc-ccg → calib

vpc-ccg / calib

Licence: MIT License
Calib clusters barcode tagged paired-end reads based on their barcode and sequence similarity.

Programming Languages

HTML
75241 projects
C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
shell
77523 projects
Makefile
30231 projects
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to calib

svict
Structural Variation and fusion detection using targeted sequencing data from circulating cell free DNA
Stars: ✭ 21 (-27.59%)
Mutual labels:  liquid-biopsy, paired-end-sequencing
dropClust
Version 2.1.0 released
Stars: ✭ 19 (-34.48%)
Mutual labels:  clustering
morphocluster
Source code for the MorphoCluster application described in Schroeder et al. 2020
Stars: ✭ 13 (-55.17%)
Mutual labels:  clustering
pytorch kmeans
Implementation of the k-means algorithm in PyTorch that works for large datasets
Stars: ✭ 38 (+31.03%)
Mutual labels:  clustering
LinearCorex
Fast, linear version of CorEx for covariance estimation, dimensionality reduction, and subspace clustering with very under-sampled, high-dimensional data
Stars: ✭ 39 (+34.48%)
Mutual labels:  clustering
scclusteval
Single Cell Cluster Evaluation
Stars: ✭ 57 (+96.55%)
Mutual labels:  clustering
watchman
Watchman: An open-source social-media event-detection system
Stars: ✭ 18 (-37.93%)
Mutual labels:  clustering
lannister
A lightweight MQTT broker w/ full spec,Clustering,WebSocket,SSL written in Java
Stars: ✭ 20 (-31.03%)
Mutual labels:  clustering
dbscan-python
[New Version] Theoretically Efficient and Practical Parallel DBSCAN
Stars: ✭ 18 (-37.93%)
Mutual labels:  clustering
Linux-admin
Shell scripts to automate download of GitHub traffic statistics, cluster administration, and create an animated GIF.
Stars: ✭ 23 (-20.69%)
Mutual labels:  clustering
Fred
A fast, scalable and light-weight C++ Fréchet distance library, exposed to python and focused on (k,l)-clustering of polygonal curves.
Stars: ✭ 13 (-55.17%)
Mutual labels:  clustering
SparseLSH
A Locality Sensitive Hashing (LSH) library with an emphasis on large, highly-dimensional datasets.
Stars: ✭ 127 (+337.93%)
Mutual labels:  clustering
kmeans
K-Means clustering
Stars: ✭ 51 (+75.86%)
Mutual labels:  clustering
hmm
A Hidden Markov Model implemented in Javascript
Stars: ✭ 29 (+0%)
Mutual labels:  clustering
rabbitmq-peer-discovery-aws
AWS-based peer discovery backend for RabbitMQ 3.7.0+
Stars: ✭ 23 (-20.69%)
Mutual labels:  clustering
peeling-onions
A repository to store Deep Web (onion domain) crawler, scraper, and NLP tools for Tor network.
Stars: ✭ 18 (-37.93%)
Mutual labels:  clustering
genie
Genie: A Fast and Robust Hierarchical Clustering Algorithm (this R package has now been superseded by genieclust)
Stars: ✭ 21 (-27.59%)
Mutual labels:  clustering
Python-Machine-Learning-Fundamentals
D-Lab's 6 hour introduction to machine learning in Python. Learn how to perform classification, regression, clustering, and do model selection using scikit-learn and TPOT.
Stars: ✭ 46 (+58.62%)
Mutual labels:  clustering
retinal-exudates-detection
exudates detection using hybrid approach (Image Morphology & Machine Learning)
Stars: ✭ 53 (+82.76%)
Mutual labels:  clustering
LIUM
Scripts for LIUM SpkDiarization tools
Stars: ✭ 28 (-3.45%)
Mutual labels:  clustering

Calib

Calib clusters paired-end reads using their barcodes and sequences. Calib is suitable for amplicon sequencing where a molecule is tagged, then PCR amplified with high depth, also known as Unique Molecule Identifier (UMI) sequencing.

Calib stands for Clustering without alignment using (locality sensitive hashing) LSH and MinHashing of barcoded reads. Calib comes for the Arabic word قالب /IPA:qaːlib/ which means template and is a reference to Calib's use of LSH templates.

Installation

Calib has two main executables: calib and calib_cons. You can install Calib directly from source, or from conda.

From source

Calib main module has one prerequisite:

  • GCC with version 5.2 or higher

Calib error correction module depends on SPOA v1.1.3 which in turn depends on CMake v3. The Makefile for Calib error correction assumes that cmake is in the path variable. However, you can also point to a specific CMake by setting the $CMAKE environment variable:

export CMAKE=path-to-cmake-v3

Then, clone this repository:

git clone -b v0.3.4 https://github.com/vpc-ccg/calib.git calib

To install Calib clustering module:

cd calib
make
cd ..

To install Calib error correction module:

cd calib
make -C consensus/
cd ..

From Conda

Just run:

conda install -c bioconda calib

This will install calib and calib_cons to your conda environment bin folder.

Other Calib scripts

Calib repository includes a simulation module that was used to fine-tune Calib's clustering parameters. The module files are under simulation directory. The module has some Python3 prerequisites that can be easily satisfied using Conda package manager:

Finally, if you want to generate the different plots (check this README) you need to also have:

Which can be also easily installed using Conda.

Running Calib

The following assumes you have calib and calib_cons in your environment $PATH variable. This is done automatically by conda.

Clustering

To run Calib clustering, run:

calib -f <reads_1> -r <reads_2> -l <barcode_tag_length> -o <output_file_prefix>

For example:

calib -f R1.fastq -r R2.fastq -l 8 -o R.

Calib will cluster the reads in <reads_1> and <reads_2> FASTQ files that are tagged with barcode tags of length <barcode_tag_length>. Note that this tag length of the length of barcode tag on one mate of the paired-end reads. The output filename will be <output_file_prefix>cluster.

Output format

The output file will contain one line per input read. Each record is tab separated with the following columns:

  1. read_cluster_id: Consecutive integers starting at 0 and ending at number of clusters - 1
  2. read_node_id: Consecutive integers starting at 0 and ending at number of nodes - 1
  3. read_id: 0-based order of the read in the input files
  4. read_f_name: FASTQ name of the read's forward mate
  5. read_f_seq: FASTQ sequence of the read's forward mate
  6. read_f_qual: FASTQ quality sequence of the read's forward mate
  7. read_r_name: FASTQ name of the read's reverse mate
  8. read_r_seq: FASTQ sequence of the read's reverse mate
  9. read_r_qual: FASTQ quality sequence of the read's reverse mate

Clustering parameters

Calib clustering has different clustering parameters that can be changed manually from the default pre-configuration:

  • --error_tolerance or -e: positive integer no larger than l, the barcode tag length
  • --kmer-size or k: positive integer
  • --minimizer-count or -m: positive integers
  • --minimizer-threshold or t: nonnegative integer no larger than m

Changing these parameters is might not be very obvious. We recommend checking with our parameter selection experiments before doing so.

Clustering multithreading

Calib clustering is can run multi-threaded using:

  • --threads or c: positive integer no larger than 8.

Note that Calib's runtime and memory do not scale well with increased number of threads. Please check our thread scalability experiments to have an idea on the time vs. memory tradeoff of Calib clustering multithreading.

Other clustering parameters

Finally, Calib clustering has these parameters that are added for convenience:

  • --ignored-sequence-prefix-length or p: nonnegative integer for the number of bases to ignore in clustering after the barcode tag in the read sequences.
  • --sort: A flag to tell Calib to group the reads of the same clusters together. Do not add this flag if you want a bit of speed-up and don't care about sorting (calib_cons module does not care about sorting).
  • -g or --gzip-input: set this flag if the input is gzipped

Error Correction (consensus module)

To run Calib error correction, run:

calib_cons -c <cluster_file> -q <space_separated_FASTQ_list> -o <space_separated_output_prefix_list>

For example:

calib_cons -c R.cluster -q R1.fastq R2.fastq -o R1. R2.

Output format

Calib error correction will output two files per input FASTQ file. One file will be a FASTQ file containing one record per consensus generated. The second file will contain multiple sequence alignment (MSA) of the cluster sequences.

Error correction parameters

Error correction has one parameter:

  • --min-reads-per-cluster or -m: positive integer for the minimum number of reads required in a cluster to output the cluster consensus. Default is 2.
  • --threads or t: positive integer for number of threads to use. Default is 4.

Simulation module

Calib has a simulation molecule that generates paired-end UMI tagged reads. The simulation pipeline is Calib's Makefile itself. It generates the following components:

  • panel: A BED file containing the exons coordinates of a list of genes. Its Make variables are:
    • annotation: GTF annotation file
    • gene_list: Text file containing set of gene names from annotation, one per line.
    • num_genes: Number of genes to sample from gene_list to be selected for making panel.
  • molecules: A FASTA file containing randomly generated molecules that overlap with the regions in panel. Its Make variables are:
    • molecule_size_mu: Average size of generated molecule
    • molecule_size_dev: Standard deviation of the size of the generated molecule.
    • min_molecule_size: Minimum size cutoff for dropping any generated molecule
    • num_molecules: Number of molecules to generate, after any dropouts due to min_molecule_size
  • barcodes: Text file containing a set of barcode tags of the same length, one per line. Its Make variables are:
    • num_barcodes
    • barcode_length
  • barcoded_molecules: A FASTA file with molecules randomly tagged with random barcode tag from barcodes, one barcode tag for either end.
  • amplified_barcoded_molecules: A FASTA file containing PCR amplified barcoded_molecules. Its Make parameters are:
    • pcr_cycles: Number of PCR cycles to perform
    • pcr_duplication_rate: Percentage of molecules to be selected for duplication in each PCR cycle from last PCR cycle.
    • pcr_error_rate: PCR substitution error rate per duplicated base in each PCR cycle.
  • simulate: A Make target to generate paired-end read FASTQ files. It has the following Make variables:
    • sequencing_machine: ART Illumina sequencing machine.
    • read_length: Read mate length to be generated

Since Calib simulation pipeline is basically a Makefile, any target that depends on the previous targets inherits its variables. For example:

make simulate num_molecules=1000

Will generate paired-end reads using all the default simulation parameters (check Makefile header) but with num_molecules of 1000.

Citation

Baraa Orabi, Emre Erhan, Brian McConeghy, Stanislav V Volik, Stephane Le Bihan, Robert Bell, Colin C Collins, Cedric Chauve, Faraz Hach; Alignment-free clustering of UMI tagged DNA molecules, Bioinformatics, , bty888, https://doi.org/10.1093/bioinformatics/bty888

Reporting issues and bugs

If you have any issues, questions, or bug reports, please open an issue and will try to address promptly.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].