All Projects → PASSIONLab → BELLA

PASSIONLab / BELLA

Licence: other
BELLA: a Computationally-Efficient and Highly-Accurate Long-Read to Long-Read Aligner and Overlapper

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
python
139335 projects - #7 most used programming language
Cuda
1817 projects
Makefile
30231 projects
shell
77523 projects

Projects that are alternatives of or similar to BELLA

downpore
Suite of tools for use in genome assembly and consensus. Work in progress.
Stars: ✭ 32 (-27.27%)
Mutual labels:  long-read
matrix-multiplication-threading
Matrix multiplication using c++11 threads
Stars: ✭ 31 (-29.55%)
Mutual labels:  matrix-multiplication
Machine-Learning
🌎 I created this repository for educational purposes. It will host a number of projects as part of the process .
Stars: ✭ 38 (-13.64%)
Mutual labels:  matrix-multiplication
Nnpack
Acceleration package for neural networks on multi-core CPUs
Stars: ✭ 1,538 (+3395.45%)
Mutual labels:  matrix-multiplication
Algorithms
A collection of algorithms and data structures
Stars: ✭ 11,553 (+26156.82%)
Mutual labels:  matrix-multiplication
sparse dot
Python wrapper for Intel Math Kernel Library (MKL) matrix multiplication
Stars: ✭ 38 (-13.64%)
Mutual labels:  matrix-multiplication
dbcsr
DBCSR: Distributed Block Compressed Sparse Row matrix library
Stars: ✭ 65 (+47.73%)
Mutual labels:  matrix-multiplication
bromberg sl2
Cayley hashing as in "Navigating in the Cayley Graph of SL₂(𝔽ₚ)"
Stars: ✭ 32 (-27.27%)
Mutual labels:  matrix-multiplication
matrix multiplication
Parallel Matrix Multiplication Using OpenMP, Phtreads, and MPI
Stars: ✭ 41 (-6.82%)
Mutual labels:  matrix-multiplication
raven-distribution-framework
Decentralized Computing Backend for Artificial Intelligence, Web3, Metaverse, and Gaming Application
Stars: ✭ 31 (-29.55%)
Mutual labels:  matrix-multiplication
Tensor
A library and extension that provides objects for scientific computing in PHP.
Stars: ✭ 146 (+231.82%)
Mutual labels:  matrix-multiplication
mir-glas
[Experimental] LLVM-accelerated Generic Linear Algebra Subprograms
Stars: ✭ 99 (+125%)
Mutual labels:  matrix-multiplication
GenericTensor
The only library allowing to create Tensors (matrices extension) with custom types
Stars: ✭ 42 (-4.55%)
Mutual labels:  matrix-multiplication
sparse
Sparse matrix formats for linear algebra supporting scientific and machine learning applications
Stars: ✭ 136 (+209.09%)
Mutual labels:  matrix-multiplication
SALSA
SALSA: A tool to scaffold long read assemblies with Hi-C data
Stars: ✭ 139 (+215.91%)
Mutual labels:  long-read
dysgu
dysgu-SV is a collection of tools for calling structural variants using short or long reads
Stars: ✭ 47 (+6.82%)
Mutual labels:  long-read
swan vis
A Python library to visualize and analyze long-read transcriptomes
Stars: ✭ 35 (-20.45%)
Mutual labels:  long-read
minorseq
Minor Variant Calling and Phasing Tools
Stars: ✭ 15 (-65.91%)
Mutual labels:  long-read
MA
The Modular Aligner and The Modular SV Caller
Stars: ✭ 39 (-11.36%)
Mutual labels:  read-aligners

BELLA - Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

BELLA is a computationally efficient and highly accurate long-read to long-read aligner and overlapper. BELLA uses a k-mer based approach to detect overlaps between noisy, long reads. We demonstrated the feasibility of the k-mer based approach through a mathematical model based on Markov chains. BELLA provides a novel algorithm for pruning k-mers that are unlikely to be useful in overlap detection and whose presence would only incur unnecessary computational costs. Our reliable k-mers detection algorithm explicitly maximizes the probability of retaining k-mers that belong to unique regions of the genome. BELLA achieves fast overlapping without sketching using sparse matrix-matrix multiplication (SpGEMM), implemented utilizing high-performance software and libraries developed for this sparse matrix subroutine. Any novel sparse matrix format and multiplication algorithm would be applicable to overlap detection and enable continued performance improvements. We coupled BELLA's overlap detection with our newly developed vectorized seed-and-extend banded-alignment algorithm. The choice of the optimal k-mer seed occurs through our binning mechanism, where k-mer positions within a read pair are used to estimate the length of the overlap and to "bin" k-mers to form a consensus.We developed and implemented a new method to separate true alignments from false positives depending on the alignment score. This method demonstrates that the probability of false positives decreases exponentially as the overlap length between sequences increases.

Content

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Dependencies

  • COMPILER: the software requires gcc-6 or higher with OpenMP to be compiled.

  • CUDA to compile and use GPU-accelerated pairwise alignment. You do not need CUDA to use CPU-based pairwise alignment. Our stand-alone GPU-based pairwise alignment, named LOGAN, can be found here.

  • Python3 and simplesam are required to generare the ground truth data. You can install simplesam via pip:

pip install simplesam

Compile

Clone the repository and enter it:

git clone https://github.com/giuliaguidi/bella
cd bella

Build using makefile:

ln -s makefile-nersc Makefile
make bella (CPU-only) OR make bella-gpu (CPU/GPU)

Run

To run with default setting:

./bella -f <list-of-fastq> -o <output-name>

BELLA requires a text file containing the path to the input fastq file(s) as the argument for the -f option. Example: input-example.txt

To show the usage:

./bella -h

Optional flag description:

  -f, --fastq arg            List of Fastq(s) (required)
  -o, --output arg           Output Filename (required)
  -k, --kmer arg             K-mer Length (default: 17)
  -x, --xdrop arg            SeqAn X-Drop (default: 7)
  -e, --error arg            Error Rate (default: 0.15)
      --estimate             Estimate Error Rate from Data
      --skip-alignment       Overlap Only
  -m, --memory arg           Total RAM of the System in MB (default: 8000)
      --score-deviation arg  Deviation from the Mean Alignment Score [0,1]
                             (default: 0.1)
  -b, --bin-size arg         Bin Size for Binning Algorithm (default: 500)
      --paf                  Output in PAF format
  -g, --gpus arg             GPUs Available (default: 1)
      --split-count arg      K-mer Counting Split Count (default: 1)
      --hopc                 Use HOPC representation
  -w, --window arg           Window Size for Minimizer Selection (default: 0)
  -s, --syncmer              Enable Syncmer Selection
  -u, --upper-freq arg       K-mer Frequency Upper Bound (default: 8)
  -l, --lower-freq arg       K-mer Frequency Lower Bound (default: 2)
  -h, --help                 Usage

The error rate is used to compute the adaptive alignment threshold. If using PacBio CCS/HiFi please set --error 0.005.

Memory Usage

The parallelism during the overlap detection phase depends on the available number of threads and on the available RAM [Default: 8000MB].

Use -DOSX or -DLINUX at compile time to estimate available RAM from your machine. If your machine has more RAM than the default one, using -DOSX or -DLINUX would make the ovelap detection phase faster.

Output Format

BELLA outputs alignments in a format similar to BLASR's M4 format. Example output (tab-delimited):

[A ID] [B ID] [# shared k-mers] [alignment score] [overlap length] [n=B fwd, c=B rc] [A start] [A end] [A length] [B start] [B end] [B length]

The positions are zero-based and are based on the forward strand, whatever which strand the sequence is mapped. If -p option is used, BELLA outputs alignments in PAF format. Example output (tab-delimited):

[A ID] [A length] [A start] [A end] ["+" = B fwd, "-" = B rc] [B ID] [B length] [B start] [B end] [alignment score] [overlap length] [mapping quality]

Performance Evaluation

The repository contains also the code to get the recall/precision of BELLA and other long-read aligners (Minimap, Minimap2, DALIGNER, MHAP and BLASR).

  • Ground truth generation for real data set: SAMparser.py allows to transform the Minimap2 .sam output file in a simpler format usable as input to the evaluation code when using real data set.
minimap2 -ax map-pb  ref.fa pacbio-reads.fq > aln.sam   # for PacBio subreads
samtools view -h -Sq 10 -F 4 aln.sam > mapped_q10.sam	# remove reads with quality values smaller than 10
samtools view -h mapped_q10.sam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -S -h > unique_mapped_q10.sam	# remove reads mapped to multiple locations
python3 SAMparser.py <bwamem/minimap2-output>
  • Ground truth generation for synthetic data set: mafconvert.py allows to transform the .MAF file from PBSIM (Pacbio read simulator) in a simpler format usable as input to the evaluation code when using synthetic data set.
python scripr/mafconvert.py axt <maf-file> > <ground-truth.txt>

To run the evaluation program:

cd benchmark
make result
./result -G <grouth-truth-file> [-B <bella-output>] [-m <minimap/minimap2-output>] [-D <daligner-output>] [-L <blasr-output>] [-H <mhap-output>] [-M <mecat-output>] [-i <mecat-idx2read-file>]

If the output of BELLA is in PAF format, you should run it using minimap2 -m flag.

To show the usage:

./result -h

NOTE: add -z flag if simulated data is used.

Demo

You can download an E. coli 30X dataset here to test BELLA. For this dataset, you can use the following single mapped ground truth to run the evaluation code: ecsample_singlemapped_q10.txt. A detailed description of the procedure we use to generate the ground truth for real data can be found in our preprint.

You can run the evaluation code located in /bench folder as:

./result -G ecsample_singlemapped_q10.txt -B <bella-output>

I get 0 outputs, what is likely going wrong?

Error rate estimation might have gone wrong. If the error estimated is greater than 1, the adaptive alignment threshold would be so high that no alignments would pass the threshold. Please check if your fastq file has proper quality values. If not, please define an error rate using command line options.

Citation

To cite our work or to know more about our methods, please refer to:

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper. Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç. bioRxiv 464420; doi: https://doi.org/10.1101/464420.

Authors

Contributors

Copyright Notice

Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper (BELLA), Copyright (c) 2018, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) Giulia Guidi and Marco Santambrogio. All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Innovation & Partnerships Office at [email protected].

NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit other to do so.

Acknowledgments

Funding provided in part by DOE ASCR through the Exascale Computing Project, and computing provided by NERSC. Thanks to Rob Egan and Steven Hofmeyr for valuable discussions. Thanks to NECST Laboratory and Ed Younis for key collaborations.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].