All Projects → refresh-bio → kmer-db

refresh-bio / kmer-db

Licence: GPL-3.0 license
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to kmer-db

Pyfaidx
Efficient pythonic random access to fasta subsequences
Stars: ✭ 307 (+351.47%)
Mutual labels:  genomics, indexing
kover
Learn interpretable computational phenotyping models from k-merized genomic data
Stars: ✭ 47 (-30.88%)
Mutual labels:  genomics, k-mer
Flowcraft
FlowCraft: a component-based pipeline composer for omics analysis using Nextflow. 🐳📦
Stars: ✭ 208 (+205.88%)
Mutual labels:  genomics
cljam
A DNA Sequence Alignment/Map (SAM) library for Clojure
Stars: ✭ 85 (+25%)
Mutual labels:  genomics
Dragonn
A toolkit to learn how to model and interpret regulatory sequence data using deep learning.
Stars: ✭ 222 (+226.47%)
Mutual labels:  genomics
Genomeworks
SDK for GPU accelerated genome assembly and analysis
Stars: ✭ 215 (+216.18%)
Mutual labels:  genomics
Biopython
Official git repository for Biopython (originally converted from CVS)
Stars: ✭ 2,936 (+4217.65%)
Mutual labels:  genomics
Juicer
A One-Click System for Analyzing Loop-Resolution Hi-C Experiments
Stars: ✭ 203 (+198.53%)
Mutual labels:  genomics
HLA
xHLA: Fast and accurate HLA typing from short read sequence data
Stars: ✭ 84 (+23.53%)
Mutual labels:  genomics
Bowtie
An ultrafast memory-efficient short read aligner
Stars: ✭ 221 (+225%)
Mutual labels:  genomics
raptor
A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.
Stars: ✭ 37 (-45.59%)
Mutual labels:  k-mer
Htsjdk
A Java API for high-throughput sequencing data (HTS) formats.
Stars: ✭ 220 (+223.53%)
Mutual labels:  genomics
Bedops
🔬 BEDOPS: high-performance genomic feature operations
Stars: ✭ 215 (+216.18%)
Mutual labels:  genomics
Canvasxpress
JavaScript VisualizationTools
Stars: ✭ 247 (+263.24%)
Mutual labels:  genomics
Higlass
Fast large scale matrix visualization for the web.
Stars: ✭ 208 (+205.88%)
Mutual labels:  genomics
fermi
A WGS de novo assembler based on the FMD-index for large genomes
Stars: ✭ 74 (+8.82%)
Mutual labels:  genomics
Minigraph
Proof-of-concept seq-to-graph mapper and graph generator
Stars: ✭ 206 (+202.94%)
Mutual labels:  genomics
Pyranges
Performant Pythonic GenomicRanges
Stars: ✭ 219 (+222.06%)
Mutual labels:  genomics
Cyvcf2
cython + htslib == fast VCF and BCF processing
Stars: ✭ 243 (+257.35%)
Mutual labels:  genomics
Mitty
Seven Bridges Genomics aligner/caller debugging and analysis tools
Stars: ✭ 13 (-80.88%)
Mutual labels:  genomics

Kmer-db

GitHub downloads Bioconda downloads GitHub Actions CI License

Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).

Quick start

git clone https://github.com/refresh-bio/kmer-db
cd kmer-db && make

INPUT=./test/virus
OUTPUT=./output
mkdir $OUTPUT

# build a database from all 18-mers (default) contained in a set of sequences
./kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.db

# establish numbers of common k-mers between new sequences and the database
./kmer-db new2all $OUTPUT/k18.db $INPUT/seqs.part2.list $OUTPUT/n2a.csv

# calculate jaccard index from common k-mers
./kmer-db distance $OUTPUT/n2a.csv

# extend the database with new sequences
./kmer-db build -extend $INPUT/seqs.part2.list $OUTPUT/k18.db

# establish numbers of common k-mers between all sequences in the database
./kmer-db all2all $OUTPUT/k18.db $OUTPUT/a2a.csv

# build a database from 10% of 25-mers using 16 threads
./kmer-db build -k 25 -f 0.1 -t 16 $INPUT/seqs.part1.list $OUTPUT/k25.db

# establish number of common 25-mers between single sequence and the database 
# (minhash filtering that retains 10% of MT159713 k-mers is done automatically prior to the comparison)  
./kmer-db one2all $OUTPUT/k25.db $INPUT/data/MT159713.fasta $OUTPUT/MT159713.csv

Table of contents

  1. Installation
  2. Usage
    1. Building a database
    2. Counting common k-mers
    3. Calculating similarities or distances
    4. Storing minhashed k-mers
  3. Datasets

1. Installation

Kmer-db comes with a set of precompiled binaries for Linux, OS X, and Windows. The software is also available on Bioconda:

conda install -c bioconda kmer-db

For detailed instructions how to set up Bioconda, please refer to the Bioconda manual. Kmer-db can be also built from the sources distributed as:

  • MAKE project (G++ 5.5.0 tested) for Linux and OS X,
  • Visual Studio 2015 solution for Windows.

zlib linking

Kmer-db uses zlib for handling gzipped inputs. Under Linux, the software is by default linked against system-installed zlib. Due to issues with some library versions, precompiled zlib is also present the repository. In order to use it, one needs to modify variable INTERNAL_ZLIB at the top of the makefile. Under Windows, the repository library is always used.

AVX and AVX2 support

Kmer-db, by default, takes advantage of AVX (required) and AVX2 (optional) CPU extensions. The pre-built binary determines supported instructions at runtime, thus it is multiplatform. When compiling the sources under Linux and OS X, the support of AVX2 is also established automatically. Under Windows, the program is by default built with AVX2 instructions. To prevent this, Kmer-db must be compiled with NO_AVX2 symbolic constant defined.

2. Usage

kmer-db <mode> [options] <positional arguments>

Kmer-db operates in one of the following modes:

  • build - building a database from samples,
  • all2all - counting common k-mers - all samples in the database,
  • new2all - counting common k-mers - set of new samples versus database,
  • one2all - counting common k-mers - single sample versus database,
  • distance - calculating similarities/distances,
  • minhash - storing minhashed k-mers,

Common options:

  • -t <threads> - number of threads (default: number of available cores),

The meaning of other options and positional arguments depends on the selected mode.

2.1. Building a database

Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:

  • compressed or uncompressed genomes/reads:

    kmer-db build [-k <kmer-length>] [-f <fraction>] [-multisample-fasta] [-extend] [-t <threads>] <sample_list> <database>

  • KMC-generated k-mers:

    kmer-db build -from-kmers [-f <fraction>] [-extend] [-t <threads>] <sample_list> <database>

  • minhashed k-mers produced by minhash mode:

    kmer-db build -from-minhash [-extend] [-t <threads>] <sample_list> <database>

Parameters:

  • sample_list (input) - file containing list of samples in the following format:
    sample_file_1
    sample_file_2
    sample_file_3
    ...
    
    By default, the tool requires uncompressed or compressed FASTA files for each sample. If a file on the list cannot be found, the package tries adding the following extensions: fna, fasta, gz, fna.gz, fasta.gz . When -from-kmers switch is specified, corresponding KMC-generated k-mer files (.kmc_pre and .kmc_suf) are required. If -from-minhash switch is present, minhashed k-mer files (.minhash) must be generated by minhash command prior to the database construction. Note, that minhashing may be also done during the database construction by specyfying -f option.
  • database (output) - file with generated k-mer database.
  • -k <kmer-length> - length of k-mers (default: 18); ignored when -from-kmers or -from-minhash switch is specified.
  • -f <fraction> - fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when -from-minhash switch is present.
  • -multisample-fasta - each sequence in a FASTA file is treated as a separate sample,
  • -extend - extend the existing database with new samples,
  • -t <threads> - number of threads (default: number of available cores).

2.2. Counting common k-mers

Samples in the database against each other:

kmer-db all2all [-buffer <size_mb>] [-sparse] [-t <threads>] <database> <common_table>

Parameters:

  • database (input) - k-mer database file created by build mode,
  • common_table (output) - file containing table with common k-mer counts.
  • -buffer <size_mb> - size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD for best performance; default: 8
  • -sparse - stores output matrix in a sparse form,
  • -above <a_th> - retains elements larger then <a_th>,
  • -below <b_th> - retains elements smaller then <b_th>.
  • -t <threads> - number of threads (default: number of available cores).

New samples against the database:

kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] [-sparse] [-t <threads>] <database> <sample_list> <common_table>

Parameters:

  • database (input) - k-mer database file created by build mode.
  • sample_list (input) - file containing list of samples in one of the supported formats (see build mode); if samples are given as genomes (default) or k-mers (-from-kmers switch), the minhashing is done automatically with the same filter as in the database.
  • common_table (output) - file containing table with common k-mer counts.
  • -multisample-fasta / -from-kmers / -from-minhash - see build mode for details.
  • -sparse - stores output matrix in a sparse form,
  • -above <a_th> - retains elements larger then <a_th>,
  • -below <b_th> - retains elements smaller then <b_th>,
  • -t <threads> - number of threads (default: number of available cores).

Single sample against the database:

kmer-db one2all [-from-kmers | -from-minhash] [-t <threads>] <database> <sample> <common_table>

The meaning of the parameters is the same as in new2all mode, but instead of specifying file with sample list, a single sample file is used as a query.

Output format

Modes all2all, new2all, and one2all produce a comma-separated table with numbers of common k-mers. The table is by default stored in a dense form:

kmer-length: k fraction: f db-samples s1 s2 ... sn
query-samples total-kmers |s1| |s2| ... |sn|
q1 |q1| |q1 ∩ s1| |q1 ∩ s2| ... |q1 ∩ sn|
q2 |q2| |q2 ∩ s1| |q2 ∩ s2| ... |q2 ∩ sn|
... ... ... ... ... ...
qm |qm| |qm ∩ s1| |qm ∩ s2| ... |qm ∩ sn|

where:

  • k - k-mer length,
  • f - minhash fraction (1, when minhashing is disabled),
  • s1, s2, ..., sn - database sample names,
  • q1, q2, ..., qm - query sample names,
  • |a| - number of k-mers in sample a,
  • |a ∩ b| - number of k-mers common for samples a and b.

For performance reasons, all2all mode produces a lower triangular matrix.

When -sparse switch is specified, the table is stored in a sparse form. In particular, zeros are omitted while non-zero elements are represented as pairs (column_id: value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:

kmer-length: k fraction: f db-samples s1 s2 ... sn
query-samples total-kmers |s1| |s2| ... |sn|
q1 |q1| i11: |q1 ∩ si11| i12: |q1 ∩ si12|
q2 |q2| i21: |q2 ∩ si21| i22: |q2 ∩ si22| i23: |q2 ∩ si23|
q2 |q2|
... ... ...
qm |qm| im1: |qm ∩ sim1|

2.3. Calculating similarities or distances

kmer-db distance [<measures>] [-sparse [-above <a_th>] [-below <b_th>]] <common_table>

Parameters:

  • common_table (input) - file containing table with numbers of common k-mers produced by all2all, new2all, or one2all mode (both, dense and sparse matrices are supported).
  • measures - names of the similarity/distance measures to be calculated, can be one or several of the following (is not specified, jaccard is used):
    • jaccard: $J(q,s) = |p \cap q| / |p \cup q|$,
    • min: $\min(q,s) = |p \cap q| / \min(|p|,|q|)$,
    • max: $\max(q,s) = |p \cap q| / \max(|p|,|q|)$,
    • cosine: $\cos(q,s) = |p \cap q| / \sqrt{|p| \cdot |q|}$,
    • mash (Mash distance): $\textrm{Mash}(q,s) = -\frac{1}{k}ln\frac{2 \cdot J(q,s)}{1 + J(q,s)}$
    • ani (average nucleotide identity): $\textrm{ANI}(q,s) = 1 - \textrm{Mash}(p,q)$
  • -phylip-out - store output distance matrix in a Phylip format,
  • -sparse - outputs a sparse matrix (independently of the input matrix format),
  • -above <a_th> - retains elements larger then <a_th>,
  • -below <b_th> - retains elements smaller then <b_th>.

This mode generates a file with similarity/distance table for each selected measure. Name of the output file is produced by adding to the input file an extension with a measure name.

2.4. Storing minhashed k-mers

This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by build, new2all, or one2all modes with -from-minhash switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:

kmer-db minhash [-k <kmer-length>] [-multisample-fasta] <fraction> <sample_list>

kmer-db minhash -from-kmers <fraction> <sample_list>

Parameters:

  • fraction (input) - fraction of all k-mers to be accepted by the minhash filter.
  • sample_list (input) - file containing list of samples in one of the supported formats (see build mode).
  • -k <kmer-length> - length of k-mers (default: 18; maximum: 30); ignored when -from-kmers switch is specified.
  • -multisample-fasta / -from-kmers - see build mode for details.

For each sample from the list, a binary file with .minhash extension containing filtered k-mers is created.

3. Datasets

List of the pathogens investigated in Kmer-db study can be found here

Citing

Deorowicz, S., Gudyś, A., Długosz, M., Kokot, M., Danek, A. (2019) Kmer-db: instant evolutionary distance estimation, Bioinformatics, 35(1): 133–136

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].