All Projects → lh3 → fermikit

lh3 / fermikit

Licence: other
De novo assembly based variant calling pipeline for Illumina short reads

Programming Languages

TeX
3793 projects
javascript
184084 projects - #8 most used programming language
perl
6916 projects
Makefile
30231 projects
shell
77523 projects

Projects that are alternatives of or similar to fermikit

dysgu
dysgu-SV is a collection of tools for calling structural variants using short or long reads
Stars: ✭ 47 (-52.04%)
Mutual labels:  bioinformatics, genomics, variant-calling
full spectrum bioinformatics
An open-access bioinformatics text
Stars: ✭ 26 (-73.47%)
Mutual labels:  bioinformatics, genomics
bystro
Bystro genetic analysis (annotation, filtering, statistics)
Stars: ✭ 31 (-68.37%)
Mutual labels:  bioinformatics, genomics
ccs
CCS: Generate Highly Accurate Single-Molecule Consensus Reads (HiFi Reads)
Stars: ✭ 79 (-19.39%)
Mutual labels:  bioinformatics, variant-calling
EarlGrey
Earl Grey: A fully automated TE curation and annotation pipeline
Stars: ✭ 25 (-74.49%)
Mutual labels:  bioinformatics, genomics
companion
This repository has been archived, currently maintained version is at https://github.com/iii-companion/companion
Stars: ✭ 21 (-78.57%)
Mutual labels:  bioinformatics, genomics
plasmidtron
Assembling the cause of phenotypes and genotypes from NGS data
Stars: ✭ 27 (-72.45%)
Mutual labels:  bioinformatics, genomics
saffrontree
SaffronTree: Reference free rapid phylogenetic tree construction from raw read data
Stars: ✭ 17 (-82.65%)
Mutual labels:  bioinformatics, genomics
chromap
Fast alignment and preprocessing of chromatin profiles
Stars: ✭ 93 (-5.1%)
Mutual labels:  bioinformatics, genomics
netSmooth
netSmooth: A Network smoothing based method for Single Cell RNA-seq imputation
Stars: ✭ 23 (-76.53%)
Mutual labels:  bioinformatics, genomics
awesome-genetics
A curated list of awesome bioinformatics software.
Stars: ✭ 60 (-38.78%)
Mutual labels:  bioinformatics, genomics
calN50
Compute N50/NG50 and auN/auNG
Stars: ✭ 20 (-79.59%)
Mutual labels:  bioinformatics, genomics
catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (-43.88%)
Mutual labels:  bioinformatics, genomics
staramr
Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Stars: ✭ 52 (-46.94%)
Mutual labels:  bioinformatics, genomics
arcsv
Complex structural variant detection from WGS data
Stars: ✭ 16 (-83.67%)
Mutual labels:  genomics, variant-calling
GenomicDataCommons
Provide R access to the NCI Genomic Data Commons portal.
Stars: ✭ 64 (-34.69%)
Mutual labels:  bioinformatics, genomics
bacnet
BACNET is a Java based platform to develop website for multi-omics analysis
Stars: ✭ 12 (-87.76%)
Mutual labels:  bioinformatics, genomics
simplesam
Simple pure Python SAM parser and objects for working with SAM records
Stars: ✭ 50 (-48.98%)
Mutual labels:  bioinformatics, genomics
ntHash
Fast hash function for DNA sequences
Stars: ✭ 66 (-32.65%)
Mutual labels:  bioinformatics, genomics
reg-gen
Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
Stars: ✭ 64 (-34.69%)
Mutual labels:  bioinformatics, genomics

Build Status

Introduction

FermiKit is a de novo assembly based variant calling pipeline for deep Illumina resequencing data. It assembles reads into unitigs, maps them to the reference genome and then calls variants from the alignment to an accuracy comparable to conventional mapping based pipelines (see evaluation in the tex directory). The assembly does not only encode SNPs and short INDELs, but also retains long deletions, novel sequence insertions, translocations and copy numbers. It is a heavily reduced representation of raw data. Storing, distributing and analyzing assemblies is much faster and cheaper at an acceptable loss of information.

FermiKit is not a prototype. It is a practical pipeline targeting large-scale data and has been used to process hundreds of human samples. On a modern server with 16 CPU cores, FermiKit can assemble 30-fold human reads in one day with about 85GB RAM at the peak. The subsequent mapping and variant calling only take half an hour.

Installation and Usage

The only library dependency of FermiKit is zlib. To compile on Linux or Mac:

git clone --recursive https://github.com/lh3/fermikit.git
cd fermikit
make

This creates a fermikit/fermi.kit directory containing all the executables. You can copy the fermi.kit directory anywhere and invoke the pipeline by specifying absolute or relative path:

# assembly reads into unitigs (-s specifies the genome size and -l the read length)
fermi.kit/fermi2.pl unitig -s3g -t16 -l150 -p prefix reads.fq.gz > prefix.mak
make -f prefix.mak
# call small variants and structural variations
fermi.kit/run-calling -t16 bwa-indexed-ref.fa prefix.mag.gz | sh

This generates prefix.mag.gz for the final assembly and prefix.flt.vcf.gz for filtered SNPs and short INDELs and prefix.sv.vcf.gz for long deletions, novel sequence insertions and complex structural variations. If you have multiple FASTQ files and want to trim adapters before assembly:

fermi.kit/fermi2.pl unitig -s3g -t16 -l150 -p prefix \
    "fermi.kit/seqtk mergepe r1.fq r2.fq | fermi.kit/trimadap-mt -p4" > prefix.mak

It is also possible to call SNPs and short INDELs from multiple BAMs at the same time and produce a multi-sample VCF:

fermi.kit/htsbox pileup -cuf ref.fa pre1.srt.bam pre2.srt.bam > out.raw.vcf
fermi.kit/k8 fermi.kit/hapdip.js vcfsum -f out.raw.vcf > out.flt.vcf

Limitations

FermiKit does not use paired-end information during assembly, which potentially leads to loss of power. In evaluations, the loss is minor for germline samples and even without pair information, FermiKit is more sensitive to short INDELs and long deletions. Furthermore, with longer upcoming Illumina reads, it is actually preferred to merge overlapping ends in a pair before assembly and treat the merged reads as regular single-end reads (see AllPaths-LG and DISCOVAR).

Another technical limitation of FermiKit is that the error correction phase may take excessive RAM when the error rate is unusually high. In practice, this concern is also minor. I have assembled ~270 human samples and none of them require more than ~90GB RAM.

Running FermiKit twice on the same dataset under the same setting is likely to result in two slightly different assemblies. Please see bfc/count.c for the cause in BFC. Unitig construction also has a random factor under the multi-threading mode. Nonetheless, FermiKit should call the same variants from the same assembly.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].