All Projects → bcgsc → ntHash

bcgsc / ntHash

Licence: MIT license
Fast hash function for DNA sequences

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
M4
1887 projects

Projects that are alternatives of or similar to ntHash

nthash
ntHash implementation in Rust
Stars: ✭ 26 (-60.61%)
Mutual labels:  genomics, hash-algorithm, hash-methods, k-mer-hashing
Hap.py
Haplotype VCF comparison tools
Stars: ✭ 249 (+277.27%)
Mutual labels:  bioinformatics, genomics
Canvasxpress
JavaScript VisualizationTools
Stars: ✭ 247 (+274.24%)
Mutual labels:  bioinformatics, genomics
faster lmm d
A faster lmm for GWAS. Supports GPU backend.
Stars: ✭ 12 (-81.82%)
Mutual labels:  bioinformatics, genomics
Bowtie
An ultrafast memory-efficient short read aligner
Stars: ✭ 221 (+234.85%)
Mutual labels:  bioinformatics, genomics
Cyvcf2
cython + htslib == fast VCF and BCF processing
Stars: ✭ 243 (+268.18%)
Mutual labels:  bioinformatics, genomics
jgi-query
A simple command-line tool to download data from Joint Genome Institute databases
Stars: ✭ 38 (-42.42%)
Mutual labels:  bioinformatics, genomics
Minigraph
Proof-of-concept seq-to-graph mapper and graph generator
Stars: ✭ 206 (+212.12%)
Mutual labels:  bioinformatics, genomics
wgs2ncbi
Toolkit for preparing genomes for submission to NCBI
Stars: ✭ 25 (-62.12%)
Mutual labels:  bioinformatics, genomics
simplesam
Simple pure Python SAM parser and objects for working with SAM records
Stars: ✭ 50 (-24.24%)
Mutual labels:  bioinformatics, genomics
unimap
A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
Stars: ✭ 76 (+15.15%)
Mutual labels:  bioinformatics, genomics
Abyss
🔬 Assemble large genomes using short reads
Stars: ✭ 219 (+231.82%)
Mutual labels:  bioinformatics, bloom-filter
Miniasm
Ultrafast de novo assembly for long noisy reads (though having no consensus step)
Stars: ✭ 216 (+227.27%)
Mutual labels:  bioinformatics, genomics
Biopython
Official git repository for Biopython (originally converted from CVS)
Stars: ✭ 2,936 (+4348.48%)
Mutual labels:  bioinformatics, genomics
Bedops
🔬 BEDOPS: high-performance genomic feature operations
Stars: ✭ 215 (+225.76%)
Mutual labels:  bioinformatics, genomics
komihash
Very fast, high-quality hash function (non-cryptographic, C) + PRNG
Stars: ✭ 68 (+3.03%)
Mutual labels:  bloom-filter, hash
Intermine
A powerful open source data warehouse system
Stars: ✭ 195 (+195.45%)
Mutual labels:  bioinformatics, genomics
Sequenceserver
Intuitive local web frontend for the BLAST bioinformatics tool
Stars: ✭ 198 (+200%)
Mutual labels:  bioinformatics, genomics
Scaff10X
Pipeline for scaffolding and breaking a genome assembly using 10x genomics linked-reads
Stars: ✭ 21 (-68.18%)
Mutual labels:  bioinformatics, genomics
dysgu
dysgu-SV is a collection of tools for calling structural variants using short or long reads
Stars: ✭ 47 (-28.79%)
Mutual labels:  bioinformatics, genomics

Release Downloads Issues

Logo

ntHash

ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.

Build the test suite

$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

To install nttest in a specified directory:

$ ./autogen.sh
$ ./configure --prefix=/opt/ntHash/
$ make
$ make install

The nttest suite has the options for runtime and uniformity tests.

Runtime test

For the runtime test the program has the following options:

nttest [OPTIONS] ... [FILE]

Parameters:

  • -k, --kmer=SIZE: the length of k-mer used for runtime test hashing [50]
  • -h, --hash=SIZE: the number of generated hashes for each k-mer [1]
  • FILE: is the input fasta or fastq file

For example to evaluate the runtime of different hash methods on the test file reads.fa in DATA/ folder for k-mer length 50, run:

$ nttest -k50 reads.fa 

Uniformity test

For the uniformity test using the Bloom filter data structure the program has the following options:

nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]

Parameters:

  • -q, --qnum=SIZE: number of queries in query file
  • -l, --qlen=SIZE: length of reads in query file
  • -t, --tnum=SIZE: number of sequences in reference file
  • -g, --tlen=SIZE: length of reference sequence
  • -i, --input: generate random query and reference files
  • -j, threads=SIZE: number of threads to run uniformity test [1]
  • REF_FILE: the reference file name
  • QUERY_FILE: the query file name

For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options:

  • 100 genes of length 5,000,000bp as reference in file genes.fa
  • 4,000,000 reads of length 250bp as query in file reads.fa
  • 12 threads

run:

$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa 

Code samples

To hash all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVal=0;
    hVal = NTF64(kmer.c_str(), k); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i++) 
    {
        hVal = NTF64(hVal, seq[i], seq[i+k], k); // consecutive hash values
        ...
    }

To canonical hash all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values
    hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i++) 
    {
        hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values
        ...
    }

To multi-hash with h hash values all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVec[h];
    NTM64(kmer.c_str(), k, h, hVec); // initial hash vector
    ...
    for (size_t i = 0; i < seq.length() - k; i++) 
    {
        NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors
        ...
    }

ntHashIterator

Enables ntHash on sequences

To hash all k-mers of length k in a given sequence seq with h hash values using ntHashIterator:

ntHashIterator itr(seq, h, k);			
while (itr != itr.end()) 
{
 ... use *itr ...
 ++itr;
}

Usage example (C++)

Outputing hash values of all k-mers in a sequence

#include <iostream>
#include <string>
#include "ntHashIterator.hpp"

int main(int argc, const char* argv[])
{
	/* test sequence */
	std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";
	
	/* k is the k-mer length */
	unsigned k = 70;
	
	/* h is the number of hashes for each k-mer */
	unsigned h = 1;

	/* init ntHash state and compute hash values for first k-mer */
	ntHashIterator itr(seq, h, k);
	while (itr != itr.end()) {
		std::cout << (*itr)[0] << std::endl;
		++itr;
	}

	return 0;
}

Publications

ntHash

Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics (2016) 32 (22): 3492-3494. doi:10.1093/bioinformatics/btw397

acknowledgements

This projects uses:

  • CATCH unit test framework for C/C++
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].