All Projects → seqan → raptor

seqan / raptor

Licence: BSD-3-Clause license
A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.

Programming Languages

C++
36643 projects - #6 most used programming language
CMake
9771 projects

Projects that are alternatives of or similar to raptor

Cfilter
Cuckoo Filter implementation in Go, better than Bloom Filters (unmaintained)
Stars: ✭ 772 (+1986.49%)
Mutual labels:  filter, bloom-filter
blex
Fast Bloom filter with concurrent accessibility, powered by :atomics module.
Stars: ✭ 34 (-8.11%)
Mutual labels:  filter, bloom-filter
ganon
ganon classifies short DNA sequences against large sets of genomic sequences efficiently, with download and update of references (RefSeq/Genbank), taxonomic (NCBI/GTDB) and hierarchical classification, customized reporting and more
Stars: ✭ 57 (+54.05%)
Mutual labels:  bloom-filter, k-mer
Boomfilters
Probabilistic data structures for processing continuous, unbounded streams.
Stars: ✭ 1,333 (+3502.7%)
Mutual labels:  filter, bloom-filter
Iir1
IIR realtime filter library written in C++
Stars: ✭ 224 (+505.41%)
Mutual labels:  filter
Sortfilterproxymodel
A nicely exposed QSortFilterProxyModel for QML
Stars: ✭ 214 (+478.38%)
Mutual labels:  filter
Ffmpegandroid
android端基于FFmpeg实现音频剪切、拼接、转码、编解码;视频剪切、水印、截图、转码、编解码、转Gif动图;音视频合成与分离,配音;音视频解码、同步与播放;FFmpeg本地推流、H264与RTMP实时推流直播;FFmpeg滤镜:素描、色彩平衡、hue、lut、模糊、九宫格等;歌词解析与显示
Stars: ✭ 2,858 (+7624.32%)
Mutual labels:  filter
Filter Console
Filter out unwanted `console.log()` output
Stars: ✭ 203 (+448.65%)
Mutual labels:  filter
Tablefilter
A Javascript library making HTML tables filterable and a bit more :)
Stars: ✭ 248 (+570.27%)
Mutual labels:  filter
Htmlpurifierbundle
HTML Purifier is a standards-compliant HTML filter library written in PHP.
Stars: ✭ 234 (+532.43%)
Mutual labels:  filter
Php Validate
Lightweight and feature-rich PHP validation and filtering library. Support scene grouping, pre-filtering, array checking, custom validators, custom messages. 轻量且功能丰富的PHP验证、过滤库。支持场景分组,前置过滤,数组检查,自定义验证器,自定义消息。
Stars: ✭ 225 (+508.11%)
Mutual labels:  filter
Fuzzysort
Fast SublimeText-like fuzzy search for JavaScript.
Stars: ✭ 2,569 (+6843.24%)
Mutual labels:  filter
Torchdata
PyTorch dataset extended with map, cache etc. (tensorflow.data like)
Stars: ✭ 226 (+510.81%)
Mutual labels:  filter
Structured Filter
jQuery UI widget for structured queries like "Contacts where Firstname starts with A and Birthday before 1/1/2000 and State in (CA, NY, FL)"...
Stars: ✭ 213 (+475.68%)
Mutual labels:  filter
Rack Reducer
Declaratively filter data via URL params, in any Rack app, with any ORM.
Stars: ✭ 241 (+551.35%)
Mutual labels:  filter
Python Benedict
dict subclass with keylist/keypath support, I/O shortcuts (base64, csv, json, pickle, plist, query-string, toml, xml, yaml) and many utilities. 📘
Stars: ✭ 204 (+451.35%)
Mutual labels:  filter
Caddy Authz
Caddy-authz is a middleware for Caddy that blocks or allows requests based on access control policies.
Stars: ✭ 221 (+497.3%)
Mutual labels:  filter
Magiccamera3
30+Camera different effects with C++ and opengles 3.0
Stars: ✭ 235 (+535.14%)
Mutual labels:  filter
Fabulousfilter
Android library to animate Floating Action Button to Bottom Sheet Dialog and vice-versa
Stars: ✭ 2,477 (+6594.59%)
Mutual labels:  filter
Ios Gpuimage Plus
GPU accelerated image filters for iOS, based on OpenGL.
Stars: ✭ 217 (+486.49%)
Mutual labels:  filter

Raptor build status codecov install with bioconda install with brew

A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

Download and Installation

There may be performance benefits when compiling from source as the build can be optimized for the host system.

Install with bioconda (Linux)

conda install -c bioconda -c conda-forge raptor

Install with brew (Linux, macOS)

brew install brewsci/bio/raptor

Compile from source

Prerequisites (click to expand)
  • CMake >= 3.18
  • GCC 10, 11 or 12 (most recent minor version)
  • git

Refer to the Seqan3 Setup Tutorial for more in depth information.

Download current main branch (click to expand)
git clone https://github.com/seqan/raptor
git submodule update --init
Download specific version (click to expand)

E.g., for version 1.1.0:

git clone --branch raptor-v1.1.0 --recurse-submodules https://github.com/seqan/raptor

Or from within an existing repository

git checkout raptor-v1.1.0
Building (click to expand)
cd raptor
mkdir -p build
cd build
cmake ..
make

The binary can be found in bin.

You may want to add the Raptor executable to your PATH:

export PATH=$(pwd)/bin:$PATH
raptor --version

By default, Raptor will be built with host specific optimizations (-march=native). This behavior can be disabled by passing -DRAPTOR_NATIVE_BUILD=OFF to CMake.

Example Data and Usage

A toy data set (124 MiB compressed, 983 MiB decompressed) can be found here.

wget https://ftp.imp.fu-berlin.de/pub/seiler/raptor/example_data.tar.gz
tar xfz example_data.tar.gz

After extraction, the example_data will look like:

$ tree -L 2 example_data
example_data
├── 1024
│   ├── bins
│   └── reads
└── 64
    ├── bins
    └── reads

The bins folder contains a FASTA file for each bin and the reads directory contains a FASTQ file for each bin containing reads from the respective bin (with 2 errors). Additionally, mini.fastq (5 reads of all bins), all.fastq (concatenation of all FASTQ files) and all10.fastq (all.fastq repeated 10 times) are provided in the reads folder.

In the following, we will use the 64 data set. To build an index over all bins, we first prepare a file that contains one file path per line (a line corresponds to a bin) and use this file as input:

seq -f "example_data/64/bins/bin_%02g.fasta" 0 1 63 > all_bin_paths.txt
raptor build --kmer 19 --window 23 --size 8m --output raptor.index all_bin_paths.txt

You may be prompted to enable or disable automatic update notifications. For questions, please consult the SeqAn documentation.

Afterwards, we can search for some reads:

raptor search --error 2 --index raptor.index --query example_data/64/reads/mini.fastq --output search.output

The output starts with a header section (lines starting with #). The header maps a number to each input file. After the header section, each line of the output consists of the read ID (in the toy example these are numbers) and the corresponding bins in which they were found:

#0	example_data/64/bins/bin_00.fasta
#1	example_data/64/bins/bin_01.fasta
...
#62	example_data/64/bins/bin_62.fasta
#63	example_data/64/bins/bin_63.fasta
#QUERY_NAME	USER_BINS
0	0
1	0
2	0
3	0
4	0
16384	1
...
1015812	62
1032192	63
1032193	63
1032194	63
1032195	63
1032196	63

For a list of options, see the help pages:

raptor --help
raptor build --help
raptor search --help
raptor upgrade --help

Preprocessing the input

We offer the option to precompute the minimisers of the input files. This is useful to build indices of big datasets (in the range of several TiB) and also allows an estimation of the needed index size since the amount of minimisers is known. Following above example, we would change the build step as follows:

First we precompute the minimisers and store them in a directory:

seq -f "example_data/64/bins/bin_%02g.fasta" 0 1 63 > all_bin_paths.txt
raptor build --kmer 19 --window 23 --compute-minimiser --output precomputed_minimisers all_bin_paths.txt

Then we run the build step again and use the computed minimisers as input:

seq -f "precomputed_minimisers/bin_%02g.minimiser" 0 1 63 > all_minimiser_paths.txt
raptor build --size 8m --output minimiser_raptor.index all_minimiser_paths.txt

The preprocessing applies the same cutoffs as used in Mantis (Pandey et al., 2018). This means that only minimisers that occur more often than the cutoff specifies are included in the output. If you wish to process all minimisers, you can use --disable-cutoffs.

Partitioned indices

To reduce the overall memory consumption, the index can be divided into multiple (a power of two) parts. This can be done by passing --parts n to raptor build, where n is the number of parts you want to create. This will create n files, each representing one part of the index. The --size parameter describes the overall size of the index. For example, --size 8g --parts 4 will create four 2 GiB indices. This will reduce the memory consumption of raptor build and raptor search by approximately 6 GiB, since there will only be one part in memory at any given time. raptor search will automatically detect the parts, and does not need any special parameters.

Upgrading the index (v1.1.0 to v2.0.0)

An old index can be upgraded by running raptor upgrade and providing some information about how the index was constructed.

Authorship and Copyright

Raptor is being developed by Enrico Seiler, but also incorporates much work from other members of SeqAn.

Citation

In your academic works (also comparisons and pipelines) please cite:

  • Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences; Enrico Seiler, Svenja Mehringer, Mitra Darvish, Etienne Turc, and Knut Reinert; iScience 2021 24 (7): 102782. doi: https://doi.org/10.1016/j.isci.2021.102782

RECOMB 2021

Raptor was presented at the 25th International Conference on Research in Computational Molecular Biology:

Please see the License section for information on allowed use.

Supplementary

The subdirectory util contains applications and scripts related to the paper.

License

Raptor is open source software. However, certain conditions apply when you (re-)distribute and/or modify Raptor, please see the license.

Sponsorships

Vercel

Vercel is kind enough to build and host our documentation and even provide preview-builds within our pull requests. Check them out!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].