ntHash

ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.

Build the test suite

$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

To install nttest in a specified directory:

$ ./autogen.sh
$ ./configure --prefix=/opt/ntHash/
$ make
$ make install

The nttest suite has the options for runtime and uniformity tests.

Runtime test

For the runtime test the program has the following options:

nttest [OPTIONS] ... [FILE]

Parameters:

-k, --kmer=SIZE: the length of k-mer used for runtime test hashing [50]
-h, --hash=SIZE: the number of generated hashes for each k-mer [1]
FILE: is the input fasta or fastq file

For example to evaluate the runtime of different hash methods on the test file reads.fa in DATA/ folder for k-mer length 50, run:

$ nttest -k50 reads.fa

Uniformity test

For the uniformity test using the Bloom filter data structure the program has the following options:

nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]

Parameters:

-q, --qnum=SIZE: number of queries in query file
-l, --qlen=SIZE: length of reads in query file
-t, --tnum=SIZE: number of sequences in reference file
-g, --tlen=SIZE: length of reference sequence
-i, --input: generate random query and reference files
-j, threads=SIZE: number of threads to run uniformity test [1]
REF_FILE: the reference file name
QUERY_FILE: the query file name

For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options:

100 genes of length 5,000,000bp as reference in file genes.fa
4,000,000 reads of length 250bp as query in file reads.fa
12 threads

run:

$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa

Code samples

To hash all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVal=0;
    hVal = NTF64(kmer.c_str(), k); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i++) 
    {
        hVal = NTF64(hVal, seq[i], seq[i+k], k); // consecutive hash values
        ...
    }

To canonical hash all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values
    hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value
    ...
    for (size_t i = 0; i < seq.length() - k; i++) 
    {
        hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values
        ...
    }

To multi-hash with h hash values all k-mers of length k in a given sequence seq:

    string kmer = seq.substr(0, k);
    uint64_t hVec[h];
    NTM64(kmer.c_str(), k, h, hVec); // initial hash vector
    ...
    for (size_t i = 0; i < seq.length() - k; i++) 
    {
        NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors
        ...
    }

ntHashIterator

Enables ntHash on sequences

To hash all k-mers of length k in a given sequence seq with h hash values using ntHashIterator:

ntHashIterator itr(seq, h, k);			
while (itr != itr.end()) 
{
 ... use *itr ...
 ++itr;
}

Usage example (C++)

Outputing hash values of all k-mers in a sequence

#include <iostream>
#include <string>
#include "ntHashIterator.hpp"

int main(int argc, const char* argv[])
{
	/* test sequence */
	std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";
	
	/* k is the k-mer length */
	unsigned k = 70;
	
	/* h is the number of hashes for each k-mer */
	unsigned h = 1;

	/* init ntHash state and compute hash values for first k-mer */
	ntHashIterator itr(seq, h, k);
	while (itr != itr.end()) {
		std::cout << (*itr)[0] << std::endl;
		++itr;
	}

	return 0;
}

Publications

ntHash

Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics (2016) 32 (22): 3492-3494. doi:10.1093/bioinformatics/btw397

acknowledgements

This projects uses:

CATCH unit test framework for C/C++

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

bcgsc / ntHash

Programming Languages

Labels

Projects that are alternatives of or similar to ntHash

ntHash

Build the test suite

Runtime test

Uniformity test

Code samples

ntHashIterator

Usage example (C++)

Publications

ntHash

acknowledgements