Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → chhylp123 → Hifiasm

chhylp123 / Hifiasm

Licence: mit

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

Labels

bioinformatics genomics

Projects that are alternatives of or similar to Hifiasm

Gcp For Bioinformatics

GCP Essentials for Bioinformatics Researchers

Stars: ✭ 95 (-29.1%)

Mutual labels: bioinformatics, genomics

Somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Stars: ✭ 128 (-4.48%)

Mutual labels: bioinformatics, genomics

Ariba

Antimicrobial Resistance Identification By Assembly

Stars: ✭ 96 (-28.36%)

Mutual labels: bioinformatics, genomics

Genomicsqlite

Genomics Extension for SQLite

Stars: ✭ 90 (-32.84%)

Mutual labels: bioinformatics, genomics

Sarek

Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing

Stars: ✭ 124 (-7.46%)

Mutual labels: bioinformatics, genomics

Bio

Bioinformatics library for .NET

Stars: ✭ 90 (-32.84%)

Mutual labels: bioinformatics, genomics

Hts Nim

nim wrapper for htslib for parsing genomics data files

Stars: ✭ 132 (-1.49%)

Mutual labels: bioinformatics, genomics

Sibeliaz

A fast whole-genome aligner based on de Bruijn graphs

Stars: ✭ 76 (-43.28%)

Mutual labels: bioinformatics, genomics

Cgranges

A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)

Stars: ✭ 111 (-17.16%)

Mutual labels: bioinformatics, genomics

Genomics

A collection of scripts and notes related to genomics and bioinformatics

Stars: ✭ 101 (-24.63%)

Mutual labels: bioinformatics, genomics

Kmer Cnt

Code examples of fast and simple k-mer counters for tutorial purposes

Stars: ✭ 124 (-7.46%)

Mutual labels: bioinformatics, genomics

Ngless

NGLess: NGS with less work

Stars: ✭ 115 (-14.18%)

Mutual labels: bioinformatics, genomics

Awesome 10x Genomics

List of tools and resources related to the 10x Genomics GEMCode/Chromium system

Stars: ✭ 82 (-38.81%)

Mutual labels: bioinformatics, genomics

Circlator

A tool to circularize genome assemblies

Stars: ✭ 121 (-9.7%)

Mutual labels: bioinformatics, genomics

Svtyper

Bayesian genotyper for structural variants

Stars: ✭ 79 (-41.04%)

Mutual labels: bioinformatics, genomics

Hicexplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.

Stars: ✭ 116 (-13.43%)

Mutual labels: bioinformatics, genomics

Bgt

Flexible genotype query among 30,000+ samples whole-genome

Stars: ✭ 72 (-46.27%)

Mutual labels: bioinformatics, genomics

Fastq.bio

An interactive web tool for quality control of DNA sequencing data

Stars: ✭ 76 (-43.28%)

Mutual labels: bioinformatics, genomics

Smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data

Stars: ✭ 98 (-26.87%)

Mutual labels: bioinformatics, genomics

Cooler

A cool place to store your Hi-C

Stars: ✭ 112 (-16.42%)

Mutual labels: bioinformatics, genomics

View All Similar Projects ➔

Getting Started

# Install hifiasm (requiring g++ and zlib)
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make

# Run on test data (use -f0 for small datasets)
wget https://github.com/chhylp123/hifiasm/releases/download/v0.7/chr11-2M.fa.gz
./hifiasm -o test -t4 -f0 chr11-2M.fa.gz 2> test.log
awk '/^S/{print ">"$2;print $3}' test.p_ctg.gfa > test.p_ctg.fa  # get primary contigs in FASTA

# Assemble inbred/homozygous genomes (-l0 disables duplication purging)
hifiasm -o CHM13.asm -t32 -l0 CHM13-HiFi.fa.gz 2> CHM13.asm.log
# Assemble heterozygous with built-in duplication purging
hifiasm -o HG002.asm -t32 HG002-file1.fq.gz HG002-file2.fq.gz

# Trio binning assembly (requiring https://github.com/lh3/yak)
yak count -b37 -t16 -o pat.yak <(cat pat_1.fq.gz pat_2.fq.gz) <(cat pat_1.fq.gz pat_2.fq.gz)
yak count -b37 -t16 -o mat.yak <(cat mat_1.fq.gz mat_2.fq.gz) <(cat mat_1.fq.gz mat_2.fq.gz)
hifiasm -o HG002.asm -t32 -1 pat.yak -2 mat.yak HG002-HiFi.fa.gz

Introduction

Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. It can assemble a human genome in several hours and works with the California redwood genome, one of the most complex genomes sequenced so far. Hifiasm can produce primary/alternate assemblies of quality competitive with the best assemblers. It also introduces a new graph binning algorithm and achieves the best haplotype-resolved assembly given trio data.

Why Hifiasm?

Hifiasm delivers high-quality assemblies. It tends to generate longer contigs and resolve more segmental duplications than other assemblers.
Given sequence reads from the parents, hifiasm can produce overall the best haplotype-resolved assembly so far. It is the assembler of choice by the Human Pangenome Project for the first batch of samples.
Hifiasm can purge duplications between haplotigs without relying on third-party tools such as purge_dups. Hifiasm does not need polishing tools like pilon or racon, either. This simplifies the assembly pipeline and saves running time.
Hifiasm is fast. It can assemble a human genome in half a day and assemble a ~30Gb redwood genome in three days. No genome is too large for hifiasm.
Hifiasm is trivial to install and easy to use. It does not required python, R or C++11 compilers and can be compiled into a single executable. The default setting works well with a variety of genomes.

Usage

A typical hifiasm command line looks like:

hifiasm -o NA12878.asm -t 32 NA12878.fq.gz

where NA12878.fq.gz provides the input reads, -t sets the number of CPUs in use and -o specifies the prefix of output files. For this example, the primary contigs are written to NA12878.asm.p_ctg.gfa and alternate contigs to NA12878.asm.a_ctg.gfa. At the first run, hifiasm saves corrected reads and overlaps to disk as NA12878.asm.*.bin. It reuses the saved results to avoid the time-consuming all-vs-all overlap calculation next time. You may specify -i to ignore precomputed overlaps and redo overlapping from raw reads.

Hifiasm purges haplotig duplications by default. For inbred or homozygous genomes, you may disable purging with option -l0. Old HiFi reads may contain short adapter sequences at the ends of reads. You can specify -z20 to trim both ends of reads by 20bp. For small genomes, use -f0 to disable the initial bloom filter which takes 16GB memory at the beginning. For genomes much larger than human, applying -f38 or even -f39 is preferred to save memory on k-mer counting.

When parental short reads are available, hifiasm can generate a pair of haplotype-resolved assemblies with trio binning. To perform such assembly, you need to count k-mers first with yak first and then do assembly:

yak count -k31 -b37 -t16 -o pat.yak paternal.fq.gz
yak count -k31 -b37 -t16 -o mat.yak maternal.fq.gz
hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak NA12878.fq.gz

Here NA12878.asm.hap1.p_ctg.gfa and NA12878.asm.hap2.p_ctg.gfa give the two haplotype assemblies. In the binning mode, hifiasm does not purge haplotig duplications by default. Because hifiasm reuses saved overlaps, you can generate both primary/alternate assemblies and trio binning assemblies with

hifiasm -o NA12878.asm -t 32 NA12878.fq.gz 2> NA12878.asm.pri.log
hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak /dev/null 2> NA12878.asm.trio.log

The second command line will run much faster than the first. You can also dump error corrected in FASTA and/or overlaps in PAF with

hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null

Output files

For non-trio assembly, hifiasm generates the following files:

Haplotype-resolved raw unitig graph in GFA format (prefix.r_utg.gfa). This graph keeps all haplotype information, including somatic mutations and recurrent sequencing errors.
Haplotype-resolved processed unitig graph without small bubbles (prefix.p_utg.gfa). Small bubbles might be caused by somatic mutations or noise in data, which are not the real haplotype information.
Primary assembly contig graph (prefix.p_ctg.gfa). This graph collapses different haplotypes.
Alternate assembly contig graph (prefix.a_ctg.gfa). This graph consists of all assemblies that are discarded in primary contig graph.

For trio assembly, hifiasm generates the following files:

Haplotype-resolved raw unitig graph in GFA format (prefix.r_utg.gfa). This graph keeps all haplotype information.
Phased paternal/haplotype1 contig graph (prefix.hap1.p_ctg.gfa). This graph keeps the phased paternal/haplotype1 assembly.
Phased maternal/haplotype2 contig graph (prefix.hap2.p_ctg.gfa). This graph keeps the phased maternal/haplotype2 assembly.

Hifiasm writes error corrected reads to the prefix.ec.bin binary file and writes overlaps to prefix.ovlp.source.bin and prefix.ovlp.reverse.bin.

Results

The following table shows the statistics of several hifiasm primary assemblies:

_Dataset	_Size	_Cov.	_{Asm options}	_{CPU time}	_{Wall time}	_RAM	_N50
_{Mouse (C57/BL6J)}	_2.6Gb	_×25	_{-t48 -l0}	_172.9h	_4.8h	_76G	_21.1Mb
_{Maize (B73)}	_2.2Gb	_×22	_{-t48 -l0}	_203.2h	_5.1h	_68G	_36.7Mb
_Strawberry	_0.8Gb	_×36	_{-t48 -D10}	_152.7h	_3.7h	_91G	_17.8Mb
_Frog	_9.5Gb	_×29	_-t48	_2834.3h	_69.0h	_463G	_9.3Mb
_Redwood	_35.6Gb	_×28	_-t80	_3890.3h	_65.5h	_699G	_5.4Mb
_{Human (CHM13)}	_3.1Gb	_×32	_{-t48 -l0}	_310.7h	_8.2h	_114G	_88.9Mb
_{Human (HG00733)}	_3.1Gb	_×33	_-t48	_269.1h	_6.9h	_135G	_69.9Mb
_{Human (HG002)}	_3.1Gb	_×36	_-t48	_305.4h	_7.7h	_137G	_98.7Mb

Hifiasm can assemble a 3.1Gb human genome in several hours or a ~30Gb hexaploid redwood genome in a few days on a single machine. For trio binning assembly:

_Dataset	_Cov.	_{CPU time}	_{Elapsed time}	_RAM	_N50
_{HG00733, [father], [mother]}	_×33	_269.1h	_6.9h	_135G	_{35.1Mb (paternal), 34.9Mb (maternal)}
_{HG002, [father], [mother]}	_×36	_305.4h	_7.7h	_137G	_{41.0Mb (paternal), 40.8Mb (maternal)}
_{NA12878, [father], [mother]}	_×30	_180.8h	_4.9h	_123G	_{27.7Mb (paternal), 27.0Mb (maternal)}

Except NA12878, the assemblies above were produced by hifiasm v0.12 and can be downloaded at

ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/submission/hifiasm-0.12/

NA12878 was assembled with an older version of hifiasm and is available at

ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/NA12878-r253/

Getting Help

For detailed description of options, please see man ./hifiasm.1. The -h option of hifiasm also provides brief description of options. If you have further questions, please raise an issue at the issue page.

Limitations

Purging haplotig duplications may introduce misassemblies.

Citation

Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021). https://doi.org/10.1038/s41592-020-01056-5

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 134

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (20) 🔗