Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → lh3 → Kmer Cnt

lh3 / Kmer Cnt

Licence: mit

Code examples of fast and simple k-mer counters for tutorial purposes

Labels

bioinformatics genomics

Projects that are alternatives of or similar to Kmer Cnt

Cooler

A cool place to store your Hi-C

Stars: ✭ 112 (-9.68%)

Mutual labels: bioinformatics, genomics

Circlator

A tool to circularize genome assemblies

Stars: ✭ 121 (-2.42%)

Mutual labels: bioinformatics, genomics

Sibeliaz

A fast whole-genome aligner based on de Bruijn graphs

Stars: ✭ 76 (-38.71%)

Mutual labels: bioinformatics, genomics

Bgt

Flexible genotype query among 30,000+ samples whole-genome

Stars: ✭ 72 (-41.94%)

Mutual labels: bioinformatics, genomics

Ngless

NGLess: NGS with less work

Stars: ✭ 115 (-7.26%)

Mutual labels: bioinformatics, genomics

Fastq.bio

An interactive web tool for quality control of DNA sequencing data

Stars: ✭ 76 (-38.71%)

Mutual labels: bioinformatics, genomics

Awesome 10x Genomics

List of tools and resources related to the 10x Genomics GEMCode/Chromium system

Stars: ✭ 82 (-33.87%)

Mutual labels: bioinformatics, genomics

Dram

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes

Stars: ✭ 47 (-62.1%)

Mutual labels: bioinformatics, genomics

Gcp For Bioinformatics

GCP Essentials for Bioinformatics Researchers

Stars: ✭ 95 (-23.39%)

Mutual labels: bioinformatics, genomics

Bio

Bioinformatics library for .NET

Stars: ✭ 90 (-27.42%)

Mutual labels: bioinformatics, genomics

Cgranges

A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)

Stars: ✭ 111 (-10.48%)

Mutual labels: bioinformatics, genomics

Smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data

Stars: ✭ 98 (-20.97%)

Mutual labels: bioinformatics, genomics

Gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins

Stars: ✭ 67 (-45.97%)

Mutual labels: bioinformatics, genomics

Hicexplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.

Stars: ✭ 116 (-6.45%)

Mutual labels: bioinformatics, genomics

Dna Nn

Model and predict short DNA sequence features with neural networks

Stars: ✭ 59 (-52.42%)

Mutual labels: bioinformatics, genomics

Svtyper

Bayesian genotyper for structural variants

Stars: ✭ 79 (-36.29%)

Mutual labels: bioinformatics, genomics

Gatk

Official code repository for GATK versions 4 and up

Stars: ✭ 1,002 (+708.06%)

Mutual labels: bioinformatics, genomics

Sns

Analysis pipelines for sequencing data

Stars: ✭ 43 (-65.32%)

Mutual labels: bioinformatics, genomics

Genomicsqlite

Genomics Extension for SQLite

Stars: ✭ 90 (-27.42%)

Mutual labels: bioinformatics, genomics

Ariba

Antimicrobial Resistance Identification By Assembly

Stars: ✭ 96 (-22.58%)

Mutual labels: bioinformatics, genomics

View All Similar Projects ➔

Getting Started

git clone https://github.com/lh3/kmer-cnt
cd kmer-cnt
make  # C++11 required to compile the two C++ implementations
wget https://github.com/lh3/kmer-cnt/releases/download/v0.1/M_abscessus_HiSeq_10M.fa.gz
./yak-count M_abscessus_HiSeq_10M.fa.gz > kc-c4.out

Introduction

K-mer counting is the foundation of many mappers, assemblers and miscellaneous tools (e.g. genotypers, metagenomics profilers, etc). It is one of the most important classes of algorithms in Bioinformatics. Here we will implement basic k-mer counting algorithms but with advanced engineering tricks. We will see how far better engineering can go.

In this repo, each {kc,yak}-*.* file implements a standalone k-mer counter. As to other files: ketopt.h is a command line option parser; khashl.h is a generic hash table library in C; kseq.h is a fasta/fastq parser; kthread.{h,c} provides two multi-threading models; robin_hood.h is a C++11 hash table library.

Results

We provide eight k-mer counters, which are detailed below the result table. All implementations count canonical k-mers, the lexicographically smaller k-mer between the k-mers on the two DNA strands.

The following table shows the timing and peak memory of different implementations for counting 31-mers from 2.5 million pairs of 100bp reads sampled from the HiSeq M. abscessus 6G-0125-R dateset in GAGE-B. They were run on a Linux server equipped with two EPYC 7301 CPUs and 512GB RAM.

Implementation	Limitation	Elapsed time (s)	CPU time (s)	Peak RAM (GB)
kc-py1 + Python3.7		499.6	499.5	8.15
kc-py1 + Pypy7.3		1220.8	1220.8	12.21
kc-cpp1		528.0	527.9	8.27
kc-cpp2		319.6	319.6	6.90
kc-c1	<=32-mer	39.3	38.3	1.52
kc-c2	<=32-mer; <1024 count	38.7	37.9	1.05
kc-c3	<=32-mer; <1024 count	34.1	38.7	1.15
kc-c4 (2+4 threads)	<=32-mer; <1024 count	7.5	35.1	1.27
yak-count (2+4; >=2 count)	<=32-mer; <1024 count	14.6	54.8	0.47
jellyfish2 (16 threads)		10.8	163.9	0.82
KMC3 (16 thr; in-mem)		9.2	36.2	5.02

Discussions

kc-py1.py is a basic Python3 implementation. It uses string translate for fast complementary. Interestingly, pypy is much slower than python3. Perhaps the official python3 comes with a better hash table implementation. Just a guess. I often recommend pypy over python. I need to be more careful about this recommendation in future.
kc-cpp1.cpp implements a basic counter in C++11 using STL's unordered_map. It is slower than python3. This is partly because STL's hash table implementation is very inefficient. C++ does not necessarily lead to a fast implementation.
kc-cpp2.cpp replaces std::unordered_map with Martin Ankerl's robin_hood hash table library, which is among the fastest hash table implementations. It is now faster than kc-py1.py, though the performance gap is small.
kc-c1.c packs k-mers no longer than 32bp into 64-bit integers. This dramatically improves speed and reduces the peak memory. Most practical k-mer counters employs bit packing. Excluding library files, this counter has less than 100 coding lines, not much more complex than the C++ or the python implementations.
kc-c2.c uses an ensemble of hash tables to save 8 bits for counter. This reduces the peak memory. The key advantage of using multiple hash tables is to implement multithreading. See below.
kc-c3.c puts file reading and parsing into a separate thread. The performance improvement is minor here, but it sets the stage for the next multi-threaded implementation.
kc-c4.c is the fastest counter in this series. Due to the use of an ensembl of hash tables in kc-c2, we can parallelize the insertion of a batch of k-mers. It is much faster than the previous versions. Notably, kc-c4 also uses less CPU time. This is probably because batching helps data locality.
yak-count.c is adapted from yak and uses the same kc-c4 algorithm. Similar to BFCounter, it optionally adds a bloom filter to filter out most singleton k-mers (k-mers occurring only once in the input). Yak needs to update the bloom filter, read the input twice and count twice. It is slower but uses less memory. Yak-count is the most complex example in this repo, but it is still short. Its code is also better organized. Command line: -b30 (bloom filter with 1 billion bits).
jellyfish2 is probably the fastest in-memory k-mer counter to date. It uses less memory and more flexible than kc-c4, but it is slower and much more complex. Command line: count -m 31 -C -s 100000000 -o /dev/null -t 16.
KMC3 is one of the fastest k-mer counters. It uses minimizers and relies on sorting. KMC3 is run in the in-memory mode here. The disk mode is as fast. KMC3 is optimized for counting much larger datasets. Although it uses more RAM here, it generally uses less RAM than jellyfish2 and other in-memory counters given high-coverage human data. Command line: -k31 -t16 -r -fa.

Conclusions

The k-mer counters here are fairly basic implementations only using generic hash tables. Nonetheless, we show better engineering can carry the basic idea a long way. If you want to implement your own k-mer counter, yak-count.c could be a good starting point. It is fast and relatively simple. By the way, if you have an efficient and simple k-mer counter (implemented in a few files), please let me know. I will be happy to add it to the table.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 124

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗