All Projects → lh3 → Miniasm

lh3 / Miniasm

Licence: mit
Ultrafast de novo assembly for long noisy reads (though having no consensus step)

Projects that are alternatives of or similar to Miniasm

Ribbon
A genome browser that shows long reads and complex variants better
Stars: ✭ 184 (-14.81%)
Mutual labels:  bioinformatics, genomics
Sequenceserver
Intuitive local web frontend for the BLAST bioinformatics tool
Stars: ✭ 198 (-8.33%)
Mutual labels:  bioinformatics, genomics
Hgvs
Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
Stars: ✭ 138 (-36.11%)
Mutual labels:  bioinformatics, genomics
Minigraph
Proof-of-concept seq-to-graph mapper and graph generator
Stars: ✭ 206 (-4.63%)
Mutual labels:  bioinformatics, genomics
Deep Rules
Ten Quick Tips for Deep Learning in Biology
Stars: ✭ 179 (-17.13%)
Mutual labels:  bioinformatics, genomics
Artemis
Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation
Stars: ✭ 135 (-37.5%)
Mutual labels:  bioinformatics, genomics
Goleft
goleft is a collection of bioinformatics tools distributed under MIT license in a single static binary
Stars: ✭ 175 (-18.98%)
Mutual labels:  bioinformatics, genomics
Somalier
fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
Stars: ✭ 128 (-40.74%)
Mutual labels:  bioinformatics, genomics
Intermine
A powerful open source data warehouse system
Stars: ✭ 195 (-9.72%)
Mutual labels:  bioinformatics, genomics
Wgsim
Reads simulator
Stars: ✭ 178 (-17.59%)
Mutual labels:  bioinformatics, genomics
Hifiasm
Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
Stars: ✭ 134 (-37.96%)
Mutual labels:  bioinformatics, genomics
Deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
Stars: ✭ 2,404 (+1012.96%)
Mutual labels:  bioinformatics, genomics
Octopus
Bayesian haplotype-based mutation calling
Stars: ✭ 131 (-39.35%)
Mutual labels:  bioinformatics, genomics
Genometools
GenomeTools genome analysis system.
Stars: ✭ 186 (-13.89%)
Mutual labels:  bioinformatics, genomics
Hts Nim
nim wrapper for htslib for parsing genomics data files
Stars: ✭ 132 (-38.89%)
Mutual labels:  bioinformatics, genomics
Awesome Bioinformatics Benchmarks
A curated list of bioinformatics bench-marking papers and resources.
Stars: ✭ 142 (-34.26%)
Mutual labels:  bioinformatics, genomics
Kmer Cnt
Code examples of fast and simple k-mer counters for tutorial purposes
Stars: ✭ 124 (-42.59%)
Mutual labels:  bioinformatics, genomics
Sarek
Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing
Stars: ✭ 124 (-42.59%)
Mutual labels:  bioinformatics, genomics
Roary
Rapid large-scale prokaryote pan genome analysis
Stars: ✭ 176 (-18.52%)
Mutual labels:  bioinformatics, genomics
Janggu
Deep learning infrastructure for bioinformatics
Stars: ✭ 174 (-19.44%)
Mutual labels:  bioinformatics, genomics

Getting Started

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap2 && (cd minimap2 && make)
git clone https://github.com/lh3/miniasm  && (cd miniasm  && make)
# Overlap for PacBio reads (or use "-x ava-ont" for nanopore read overlapping)
minimap2/minimap2 -x ava-pb -t8 pb-reads.fq pb-reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Introduction

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3 produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Algorithm Overview

  1. Crude read selection. For each read, find the longest contiguous region covered by three good mappings. Get an approximate estimate of read coverage.

  2. Fine read selection. Use the coverage information to find the good regions again but with more stringent thresholds. Discard contained reads.

  3. Generate a string graph. Prune tips, drop weak overlaps and collapse short bubbles. These procedures are similar to those implemented in short-read assemblers.

  4. Merge unambiguous overlaps to produce unitig sequences.

Limitations

  1. Consensus base quality is similar to input reads (may be fixed with a consensus tool).

  2. Only tested on a dozen of high-coverage PacBio/ONT data sets (more testing needed).

  3. Prone to collapse repeats or segmental duplications longer than input reads (hard to fix without error correction).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].