All Projects → edawson → mkmh

edawson / mkmh

Licence: MIT license
Generate kmers/minimizers/hashes/MinHash signatures, including with multiple kmer sizes.

Programming Languages

C++
36643 projects - #6 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to mkmh

rkmh
Classify sequencing reads using MinHash.
Stars: ✭ 42 (+100%)
Mutual labels:  minhash, kmer
set-sketch-paper
SetSketch: Filling the Gap between MinHash and HyperLogLog
Stars: ✭ 23 (+9.52%)
Mutual labels:  minhash, locality-sensitive-hashing
Datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble
Stars: ✭ 1,635 (+7685.71%)
Mutual labels:  minhash, locality-sensitive-hashing
bagminhash
BagMinHash - Minwise Hashing Algorithm for Weighted Sets
Stars: ✭ 24 (+14.29%)
Mutual labels:  minhash, locality-sensitive-hashing
image-ndd-lsh
Near-duplicate image detection using Locality Sensitive Hashing
Stars: ✭ 42 (+100%)
Mutual labels:  locality-sensitive-hashing
learning2hash.github.io
Website for "A survey of learning to hash for Computer Vision" https://learning2hash.github.io
Stars: ✭ 14 (-33.33%)
Mutual labels:  locality-sensitive-hashing
recommendation-retrieval
A tutorial on scalable retrieval of matrix factorization recommendations
Stars: ✭ 27 (+28.57%)
Mutual labels:  locality-sensitive-hashing
tiptoft
Predict plasmids from uncorrected long read data
Stars: ✭ 27 (+28.57%)
Mutual labels:  kmer
stringMLST
Fast k-mer based tool for multi locus sequence typing (MLST)
Stars: ✭ 33 (+57.14%)
Mutual labels:  kmer
intertext
Detect and visualize text reuse
Stars: ✭ 97 (+361.9%)
Mutual labels:  minhash
catch-me-if-you-can
plagiarism detector
Stars: ✭ 16 (-23.81%)
Mutual labels:  minhash
Annoy
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
Stars: ✭ 9,262 (+44004.76%)
Mutual labels:  locality-sensitive-hashing
minhash-lsh
Minhash LSH in Golang
Stars: ✭ 20 (-4.76%)
Mutual labels:  minhash
HyperMinHash-java
Union, intersection, and set cardinality in loglog space
Stars: ✭ 48 (+128.57%)
Mutual labels:  minhash
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-14.29%)
Mutual labels:  minhash
text-shingles
k-shingling for text to help compare similarity
Stars: ✭ 15 (-28.57%)
Mutual labels:  minhash
mccortex
De novo genome assembly and multisample variant calling
Stars: ✭ 105 (+400%)
Mutual labels:  kmer
tlsh
TLSH lib in Golang
Stars: ✭ 110 (+423.81%)
Mutual labels:  locality-sensitive-hashing
STing
Ultrafast sequence typing and gene detection from NGS raw reads
Stars: ✭ 15 (-28.57%)
Mutual labels:  kmer
ExpressionMatrix2
Software for exploration of gene expression data from single-cell RNA sequencing.
Stars: ✭ 29 (+38.1%)
Mutual labels:  locality-sensitive-hashing

mkmh

Make kmers, minimizers, hashes, and MinHash sketches (with multiple k), and compare them.

C/C++ CI for mkmh

Usage

To use mkmh functions in your code:

  1. Include the header file in your code
    #include "mkmh.hpp"
  2. Compile the library:
    cd mkmh && make lib
  3. Make sure the lib and header are on the LD include/lib paths (e.g. in your makefile):
    `` gcc -o my_code my_code.cpp -L/path/to/mkmh -I/path/to/mkmh -lmkmh
  4. That's it!

Available functionality

Convenience functions:
- Reverse complement a string
- Reverse a string
- Capitalize the characters of a string - Check if a string contains only canonical DNA letters ("A", "a", "C", "c", "T", "t", "G", "g")

Substrings and transforms:
- Get the forward shingles of a string
- Get the kmers size k of a string
- For multiple k, Get the kmers of a string for all k
- Get the (w, k) minimizers of a string
- Calculate the 64-bit hashes of the kmers of a string (with either single or multiple k values)
- Get the MinHash sketch of a string (from either single or multiple k values), using either the top s hashes or the bottom s hashes.

Compare sets of shingles / kmers / minimizers / hashes:
- Take the union of two sets of kmers or hashes.
- Take the intersection of two sets of kmers or hashes.

Fun extras:
- Given a string and a set of query strings, sort the queries in order of percent similarity.

Getting help

Please reach out through github by posting an issue (even if it's just feedback). Email is acceptable as a secondary medium.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].