All Projects → bingmann → cobs

bingmann / cobs

Licence: MIT license
COBS - Compact Bit-Sliced Signature Index (for Genomic k-Mer Data or q-Grams)

Programming Languages

C++
36643 projects - #6 most used programming language
perl
6916 projects
python
139335 projects - #7 most used programming language
CMake
9771 projects
Makefile
30231 projects
Batchfile
5799 projects
shell
77523 projects

Projects that are alternatives of or similar to cobs

libdna
♥ Essential Functions for DNA Manipulation
Stars: ✭ 15 (-76.56%)
Mutual labels:  dna
markdown-index
Generate a global index for multiple markdown files recursively
Stars: ✭ 15 (-76.56%)
Mutual labels:  index
tile38
Real-time Geospatial and Geofencing
Stars: ✭ 8,117 (+12582.81%)
Mutual labels:  index
FluentDNA
FluentDNA allows you to browse sequence data of any size using a zooming visualization similar to Google Maps. You can use FluentDNA as a standalone program or as a python module for your own bioinformatics projects.
Stars: ✭ 52 (-18.75%)
Mutual labels:  dna
poly
A Go package for engineering organisms.
Stars: ✭ 270 (+321.88%)
Mutual labels:  dna
PHAT
Pathogen-Host Analysis Tool - A modern Next-Generation Sequencing (NGS) analysis platform
Stars: ✭ 17 (-73.44%)
Mutual labels:  dna
cobs-c
Consistent Overhead Byte Stuffing — C implementation
Stars: ✭ 87 (+35.94%)
Mutual labels:  cobs
bkdtree
Persistent Block KD Tree In Golang for Search Filtering
Stars: ✭ 32 (-50%)
Mutual labels:  index
seqfold
minimalistic nucleic acid folding
Stars: ✭ 39 (-39.06%)
Mutual labels:  dna
flow-indexer
Flow-Indexer indexes flows found in chunked log files from bro,nfdump,syslog, or pcap files
Stars: ✭ 43 (-32.81%)
Mutual labels:  index
BuddySuite
Bioinformatics toolkits for manipulating sequence, alignment, and phylogenetic tree files
Stars: ✭ 106 (+65.63%)
Mutual labels:  dna
index-autoload
Adds an index to the autoload in wp_options table and verifies it exists on a daily basis (using WP Cron), resulting in a more efficient database.
Stars: ✭ 18 (-71.87%)
Mutual labels:  index
charles-rest
Github chatbot and web-content indexer/searcher
Stars: ✭ 24 (-62.5%)
Mutual labels:  index
pufferfish
An efficient index for the colored, compacted, de Bruijn graph
Stars: ✭ 94 (+46.88%)
Mutual labels:  index
jquery-alphaindex
jQuery plugin to create alphabetical indexes for your lists
Stars: ✭ 12 (-81.25%)
Mutual labels:  index
SQLFlow
SQLFlow is a bridge that connects a SQL engine, e.g. MySQL, Hive, SparkSQL or SQL Server, with TensorFlow and other machine learning toolkits. SQLFlow extends the SQL language to enable model training, prediction and inference.
Stars: ✭ 72 (+12.5%)
Mutual labels:  index
visualize-dna-sequences
Visualizing DNA Sequences via Javascript
Stars: ✭ 51 (-20.31%)
Mutual labels:  dna
libDrive
libDrive is a Google Drive media library manager and indexer, similar to Plex, that organizes Google Drive media to offer an intuitive and user-friendly experience.
Stars: ✭ 14 (-78.12%)
Mutual labels:  index
feels
🌀 Calculate apparent temperature using heat index, approximate wet-bulb globe temperature, humidex, australian apparent temperature and wind chill.
Stars: ✭ 25 (-60.94%)
Mutual labels:  index
hlatyping
Precision HLA typing from next-generation sequencing data
Stars: ✭ 28 (-56.25%)
Mutual labels:  dna

Compact Bit-Sliced Signature Index (COBS)

COBS (COmpact Bit-sliced Signature index) is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length and the false positive rate of the index determined at construction time. COBS' compact but simple data structure outperforms other indexes in construction time and query performance with Mantis by Pandey et al. in second place. However, unlike Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.

cobs-architecture

COBS has two interfaces: ( Build Status Coverage Status )

More information about COBS is presented in our current research paper: Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. "COBS: a Compact Bit-Sliced Signature Index". In: 26th International Symposium on String Processing and Information Retrieval (SPIRE). pages 285-303. Spinger. October 2019. preprint arXiv:1905.09624.

If you use COBS in an academic context or publication, please cite our paper

@InProceedings{bingmann2019cobs,
  author =       {Timo Bingmann and Phelim Bradley and Florian Gauger and Zamin Iqbal},
  title =        {{COBS}: a Compact Bit-Sliced Signature Index},
  booktitle =    {26th International Conference on String Processing and Information Retrieval (SPIRE)},
  year =         2019,
  series =       {LNCS},
  pages =        {285--303},
  month =        oct,
  organization = {Springer},
  note =         {preprint arXiv:1905.09624},
}

Installation and First Steps

Installation

COBS requires CMake, a C++17 compiler or the Boost.Filesystem library.

To download and install COBS run:

git clone --recursive https://github.com/bingmann/cobs.git
mkdir cobs/build
cd cobs/build
cmake ..
make -j4

and optionally run make test to check the build.

Building an Index

COBS can read FASTA files (*.fa, *.fasta, *.fna, *.ffn, *.faa, *.frn, *.fa.gz, *.fasta.gz, *.fna.gz, *.ffn.gz, *.faa.gz, *.frn.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), "Multi-FASTA" and "Multi-FASTQ" files (*.mfasta, *.mfastq), McCortex files (*.ctx), or text files (*.txt). See below on details how they are parsed.

You can either recursively scan a directory for all files matching any of these files, or pass a *.list file which lists all paths COBS should index.

To check the document list to be indexed, run for example

src/cobs doc-list tests/data/fasta/

To construct a compact COBS index from these seven example documents run

src/cobs compact-construct tests/data/fasta/ example.cobs_compact

Or construct a compact COBS index from a list of documents by running

src/cobs compact-construct tests/data/fasta_files.list example.cobs_compact

The paths in the file list can be absolute or relative to the file list's path. Note that *.txt files are read as verbatim text files. You can force COBS to read a .txt file as a file list using --file-type list.

Check --help for many options.

Query an Index

COBS has a simple command line query tool:

src/cobs query -i example.cobs_compact AGTCAACGCTAAGGCATTTCCCCCCTGCCTCCTGCCTGCTGCCAAGCCCT

or a fasta file of queries with

src/cobs query -i example.cobs_compact -f query.fa

Multiple indices can be queried at once by adding more -i parameters.

Python Interface

COBS also has a Python frontend interface which can be used to construct and query an index. See https://bingmann.github.io/cobs-python-docs/ for a tutorial.

Experimental Results

In our paper we compare COBS against seven other k-mer indexing software packages. These are the main results, scaling by number of documents in the index, and in the second diagram shown per document.

cobs-experiments-scaling cobs-experiments-scaling-per-documents

More Details

File Types and How They Are Parsed

COBS can read FASTA files (*.fa, *.fasta, *.fa.gz, *.fasta.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), "Multi-FASTA" and "Multi-FASTQ" files (*.mfasta, *.mfastq), McCortex files (*.ctx), or text files (*.txt). Each file type is parsed slightly differently into q-grams or k-mers.

FASTA files are parsed as one document each. If a FASTA file contains multiple sequences or reads then they are combined into one document. Multiple sequences (separated by comments) are NOT concatenated trivially, instead the k-mers are extracted separately from each sequence. This means there are no erroneous k-mers from the beginning or end of crossing sequences. All newlines within a sequence are removed.

The k-mers from DNA sequences are automatically canonicalized (the lexicographically smaller is indexed). By adding the flag --no-canonicalize this process can be skipped. With canonicalization only ACGT letters are indexed, every other letter is mapped to binary zeros and index with the other data. A warning per FASTA/FASTQ file containing a non-ACGT letter is printed, but processing continues. With the flag --no-canonicalize any letters or text can be indexed.

FASTQ files are also parsed as one document each. The quality information is dropped and effectively everything is parsed identical to FASTA files.

Multi-FASTA or Multi-FASTQ files are parsed as many documents. Each sequence in the FASTA or FASTQ file is considered a separate document in the COBS index. Their names are append with _### where ### is the index of the subdocument.

McCortex files (*.ctx) contain a list of k-mers and these k-mers are indexes individually. The graph information is ignored. Only k=31 is currently supported.

Text files (*.txt) are parsed as verbatim binary documents. All q-grams are extracted, including newlines and other whitespace.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].