All Projects → dnbaker → Dashing

dnbaker / Dashing

Licence: gpl-3.0
Fast and accurate genomic distances using HyperLogLog

Labels

Projects that are alternatives of or similar to Dashing

Elasticsearch
The missing elasticsearch ORM for Laravel, Lumen and Native php applications
Stars: ✭ 375 (+186.26%)
Mutual labels:  indexing
Filemasta
A search application to explore, discover and share online files
Stars: ✭ 571 (+335.88%)
Mutual labels:  indexing
Algolia Webcrawler
Simple node worker that crawls sitemaps in order to keep an algolia index up-to-date
Stars: ✭ 40 (-69.47%)
Mutual labels:  indexing
Mayan Edms
Repository mirror of GtLab: https://gitlab.com/mayan-edms/mayan-edms Please use the upstream repository for issues and pull requests.
Stars: ✭ 398 (+203.82%)
Mutual labels:  indexing
Pgm Index
🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
Stars: ✭ 499 (+280.92%)
Mutual labels:  indexing
Elki
ELKI Data Mining Toolkit
Stars: ✭ 613 (+367.94%)
Mutual labels:  indexing
Js Worker Search
JavaScript client-side search API with web-worker support
Stars: ✭ 345 (+163.36%)
Mutual labels:  indexing
Completely
Java autocomplete library.
Stars: ✭ 90 (-31.3%)
Mutual labels:  indexing
Ethql
A GraphQL interface to Ethereum 🔥
Stars: ✭ 547 (+317.56%)
Mutual labels:  indexing
Solrbulk
SOLR bulk indexing utility for the command line.
Stars: ✭ 35 (-73.28%)
Mutual labels:  indexing
Opensearchserver
Open-source Enterprise Grade Search Engine Software
Stars: ✭ 408 (+211.45%)
Mutual labels:  indexing
Faster
Fast persistent recoverable log and key-value store + cache, in C# and C++.
Stars: ✭ 4,846 (+3599.24%)
Mutual labels:  indexing
Hugo Elasticsearch
Generate Elasticsearch indexes for Hugo static sites by parsing front matter
Stars: ✭ 19 (-85.5%)
Mutual labels:  indexing
Seqan
SeqAn's official repository.
Stars: ✭ 386 (+194.66%)
Mutual labels:  indexing
Tinspin Indexes
Spatial index library with R*Tree, STR-Tree, Quadtree, CritBit, KD-Tree, CoverTree
Stars: ✭ 64 (-51.15%)
Mutual labels:  indexing
Xapiand
Xapiand: A RESTful Search Engine
Stars: ✭ 347 (+164.89%)
Mutual labels:  indexing
Hypopg
Hypothetical Indexes for PostgreSQL
Stars: ✭ 594 (+353.44%)
Mutual labels:  indexing
Btree4j
Disk-based B+-tree written in Pure Java
Stars: ✭ 122 (-6.87%)
Mutual labels:  indexing
Pumpkindb
Immutable Ordered Key-Value Database Engine
Stars: ✭ 1,219 (+830.53%)
Mutual labels:  indexing
Deepdatabase
A relational database engine using B+ tree indexing
Stars: ✭ 32 (-75.57%)
Mutual labels:  indexing

Dashing 🕺 Build Status install with bioconda

dashing sketches and computes distances between fasta and fastq data.

Our paper is available here as a preprint and here at Genome Biology.

Use

The easiest way to use dashing is to grab a binary release. These are located in dashing/release/{osx,linux}/dashing_s{128,256,512}, where dashing_s128, dashing_s256, and dashing_s512 work, respectively, on systems supporting SSE2, AVX2, and AVX512BW. If these don't work, you'll need to build from source.

Build

Clone this repository recursively, and use make.

git clone --recursive https://github.com/dnbaker/dashing
cd dashing && make dashing

If you clone without submodules, this can be corrected by git submodule update --init --recursive.

Dashing is written in C++14, which means that it requires a relatively new compiler. Dashing is tested under gcc{5.4-9}, but fails for gcc4, which is installed by default on many machines. For OSX, we recommend using Homebrew to install gcc-8. On Linux, we recommend package managers. (For instance, our Travis-CI Ubuntu example upgrades to a sufficiently new GCC using sudo update-alternatives.)

Usage

To see all usage options, use ./dashing <subcommand>, for subcommand in [sketch, dist, hll, union, printmat]. Of most interest is probably the dist command, which can take either genomes or pre-built sketches as arguments.

dist

For the simplest case of unspaced, unminimized kmers for a set of genomes with k = 31 and 13 threads:

dashing dist -k31 -p13 -Odistance_matrix.txt -osize_estimates.txt genome1.fna.gz genome2.fna genome3.fasta <...>

The genomes can be omitted as positional arguments if -F genome_paths.txt is provided, where genome_paths.txt is a file containing a path to a genome per line. This can avoid system limits on the number of arguments in a shell command.

These can be cached with -c, which saves the sketches for later use. These sketch filenames are based on spacing, kmer size, and sketch size, so there is no risk of overwriting each other.

dist (asymmetric mode)

dashing dist performs all pairwise jaccard index estimates by default. By providing the -Q flag, dashing performs a core comparison operation between all queries and all references, where references are provided by -F.

This is necessary to provide containment.

For example:

dashing dist --containment-index -k21 -Odistmat.txt -ofsizes.txt -Q query_paths.txt -F ref_paths.txt

To generate a full, asymmetric distance matrix, provide the same path to -F and -Q.

sketch

The sketch command largely mirrors dist, except that only sketches are computed.

dashing sketch -k31 -p13 -F genome_paths.txt

hll

The hll command simply estimates the number of unique elements in a set of files. This can be useful for estimating downstream database sizes based on spacing schemes.

union

The union command takes a set of pre-sketched HLLs and performs unions between them. Currently, the sketches must be of the same size. We may modify this in future releases to allow a merger of different sizes by flattening larger sketches to the smallest sketch size discovered. This would involve a loss of precision from the larger models. This currently doesn't support data structures besides HLLs, but we plan to make this change at a later date.

Features

Transparently consumes uncompressed, zlib- or zstd-compressed files.

Caching of sketches to disk (in compressed form)

Calculation of a variety of (dis)similarity measures:

  1. Jaccard Similarity
  2. Mash distance
  3. Containment index
  4. Containment distance (log transformed containment index)
  5. Symmetric Containment Index (\frac{|A \bigcap B|}{\min{|A|,|B|}}) (The maximum of each containment index)
  6. Symmetric Continament Distance (log transformed SCI)
  7. Intersection size

Additionally, supports all the above under the weighted/multiset Jaccard index via labeled w-shingling. (See Broder, 1997 "On the Resemblance and Containment of Documents" for more details.)

Filtering

Filtering of of rare k-mer events via count-min sketch point query estimates. This is primarily desirable for raw sequencing datasets rather than genome assemblies. This is enabled with -y/--countmin, and the number of hashes (--nhashes), sketch size (--cm-sketch-size) and min count --min-count can all be controlled by command-line parameters.

Encoding options

  1. Exact k-mer encoding (k <= 32)
  2. Rolling hashing encoding for any k
  3. Spaced seed encoding (Hamming weight <= 32)
  4. Windowed/minimized k-mers

See the bonsai for more details on encoding.

Output formats

Dashing defaults to upper triangular TSV matrix emission, but it also suppurts upper triangular PHYLIP format, packed binary encoding, and top k-nearest-neighbor emission formats.

This is supported for all symmetric measures (Mash distance, Jaccard, Intersection size, and their multiset equivalents), whereas asymmetric measures and nearest neighbor forms (all variations of containment) have two emission options: tabular and binary.

Alternative Data Structures

Dashing supports comparisons with a variety of data structures, which have speed and accuracy tradeoffs for given situations. By default, HyperLogLog sketches are used, while b-bit minhashing, bottom-k minhashing, bloom filters, and hash sets are supported. Using hash sets provides a ground truth at the expense of greatly increased runtime costs.

b-bit minhashing:             --use-bb-minhash

bottom-k minhashing:          --use-range-minhash

weighted bottom-k minhashing: --use-counting-range-minhash

SuperMinHash:                 --use-super-minhash

hash sets:                    --use-full-khash-sets

bloom filters:                --use-bloom-filter

Wide HLL:                     --use-wide-hll

References: SuperMinHash, modified. (Use 32-bit register instead of float between 0 and 1 to make use of more information.) Bloom Filter Jaccard Index

To Cite:

@Article{pmid31801633,
   Author="Baker, D. N.  and Langmead, B. ",
   Title="{{D}ashing: fast and accurate genomic distances with {H}yper{L}og{L}og}",
   Journal="Genome Biol.",
   Year="2019",
   Volume="20",
   Number="1",
   Pages="265",
   Month="12"
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].