All Projects → lh3 → bfc

lh3 / bfc

Licence: MIT license
High-performance error correction for Illumina resequencing data

Programming Languages

TeX
3793 projects
c
50402 projects - #5 most used programming language
C++
36643 projects - #6 most used programming language
javascript
184084 projects - #8 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to bfc

Canvasxpress
JavaScript VisualizationTools
Stars: ✭ 247 (+274.24%)
Mutual labels:  genomics
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (-66.67%)
Mutual labels:  genomics
assigner
Population assignment analysis using R
Stars: ✭ 17 (-74.24%)
Mutual labels:  genomics
cljam
A DNA Sequence Alignment/Map (SAM) library for Clojure
Stars: ✭ 85 (+28.79%)
Mutual labels:  genomics
kmer-db
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
Stars: ✭ 68 (+3.03%)
Mutual labels:  genomics
cerebra
A tool for fast and accurate summarizing of variant calling format (VCF) files
Stars: ✭ 55 (-16.67%)
Mutual labels:  genomics
Cyvcf2
cython + htslib == fast VCF and BCF processing
Stars: ✭ 243 (+268.18%)
Mutual labels:  genomics
ezancestry
Easy genetic ancestry predictions in Python
Stars: ✭ 38 (-42.42%)
Mutual labels:  genomics
berokka
🍊 💫 Trim, circularise and orient long read bacterial genome assemblies
Stars: ✭ 23 (-65.15%)
Mutual labels:  genomics
sequencework
programs and scripts, mainly python, for analyses related to nucleic or protein sequences
Stars: ✭ 22 (-66.67%)
Mutual labels:  genomics
fermi
A WGS de novo assembler based on the FMD-index for large genomes
Stars: ✭ 74 (+12.12%)
Mutual labels:  genomics
Mitty
Seven Bridges Genomics aligner/caller debugging and analysis tools
Stars: ✭ 13 (-80.3%)
Mutual labels:  genomics
metaRNA
Find target sites for the miRNAs in genomic sequences
Stars: ✭ 19 (-71.21%)
Mutual labels:  genomics
Hap.py
Haplotype VCF comparison tools
Stars: ✭ 249 (+277.27%)
Mutual labels:  genomics
wgd
Python package and CLI for whole-genome duplication related analyses
Stars: ✭ 68 (+3.03%)
Mutual labels:  genomics
Biopython
Official git repository for Biopython (originally converted from CVS)
Stars: ✭ 2,936 (+4348.48%)
Mutual labels:  genomics
GenomicsDB
Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.
Stars: ✭ 77 (+16.67%)
Mutual labels:  genomics
genipe
Genome-wide imputation pipeline
Stars: ✭ 28 (-57.58%)
Mutual labels:  genomics
viGEN
viGEN - A bioinformatics pipeline for the exploration of viral RNA in human NGS data
Stars: ✭ 24 (-63.64%)
Mutual labels:  genomics
aws-genomics-workflows
Genomics Workflows on AWS
Stars: ✭ 131 (+98.48%)
Mutual labels:  genomics

Introduction

BFC is a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data. It is specifically designed for high-coverage whole-genome human data, though also performs well for small genomes.

The BFC algorithm is a variant of the classical spectrum alignment algorithm introduced by Pevzner et al (2001). It uses an exhaustive search to find a k-mer path through a read that minimizes a heuristic objective function jointly considering penalties on correction, quality and k-mer support. This algorithm was first implemented in my fermi assembler and then refined a few times in fermi, fermi2 and now in BFC. In the k-mer counting phase, BFC uses a blocked bloom filter to filter out most singleton k-mers and keeps the rest in a hash table (Melsted and Pritchard, 2011). The use of bloom filter is how BFC is named, though other correctors such as Lighter and Bless actually rely more on bloom filter than BFC.

Usage

BFC can be invoked as:

bfc -s 3g -t16 reads.fq.gz | gzip -1 > corrected.fq.gz

where option -s specifies the approximate size of the genome. It is possible to use one set of reads to correct another set:

bfc -s 3g -t16 readset1.fq.gz readset2.fq.gz | gzip -1 > corrected_readset2.fq.gz

and to process data from Unix pipes ("<(command)" is bash specific):

bash -c "bfc -s 3g -t16 <(bzip2 -dc reads.fq.bz2) <(bzip2 -dc reads.fq.bz2) | gzip -1 > out.fq.gz"

BFC also offers an option to trim reads containing singleton k-mers (don't switch -s and -k as some options are ordered):

bfc -1 -s 3g -k51 -t16 corrected.fq.gz | gzip -1 > trimmed.fq.gz

This command line keeps k-mer occurring twice or more in a bloom filter (with some false positives) and identifies the longest stretch in a read that has hits in the bloom filter. K-mer trimming is about four times as fast as error correction.

BFC-KMC

An alternative implementation of the algorithm is available at the kmc branch of this repository. It uses KMC2 for k-mer counting and keeps high-occurrence k-mers in a bloom filter. BFC-KMC should be invoked as:

kmc -k55 reads.fq.gz prefix tmpdir
bfc-kmc -t16 prefix reads.fq.gz | gzip -1 > corrected.fq.gz

KMC2 source code and precompiled binaries are available at the KMC website.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].