All Projects → adigenova → wengan

adigenova / wengan

Licence: AGPL-3.0 license
An accurate and ultra-fast hybrid genome assembler

Programming Languages

perl
6916 projects

Projects that are alternatives of or similar to wengan

fast-sg
Fast-SG: An alignment-free algorithm for ultrafast scaffolding graph construction from short or long reads.
Stars: ✭ 22 (-72.84%)
Mutual labels:  nanopore, pacbio, illumina
haslr
A fast tool for hybrid genome assembly of long and short reads
Stars: ✭ 68 (-16.05%)
Mutual labels:  nanopore, pacbio
CliqueSNV
No description or website provided.
Stars: ✭ 13 (-83.95%)
Mutual labels:  pacbio, illumina
Winnowmap
Long read / genome alignment software
Stars: ✭ 151 (+86.42%)
Mutual labels:  nanopore, pacbio
IsoQuant
Reference-based transcript discovery from long RNA read
Stars: ✭ 26 (-67.9%)
Mutual labels:  nanopore, pacbio
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (-72.84%)
Mutual labels:  pacbio, illumina
tiptoft
Predict plasmids from uncorrected long read data
Stars: ✭ 27 (-66.67%)
Mutual labels:  nanopore, pacbio
ngs-preprocess
A pipeline for preprocessing NGS data from Illumina, Nanopore and PacBio technologies
Stars: ✭ 22 (-72.84%)
Mutual labels:  pacbio, illumina
rerio
Research release basecalling models and configurations
Stars: ✭ 60 (-25.93%)
Mutual labels:  nanopore
pepper
PEPPER-Margin-DeepVariant
Stars: ✭ 179 (+120.99%)
Mutual labels:  nanopore
recentrifuge
Recentrifuge: robust comparative analysis and contamination removal for metagenomics
Stars: ✭ 79 (-2.47%)
Mutual labels:  nanopore
TALON
Technology agnostic long read analysis pipeline for transcriptomes
Stars: ✭ 96 (+18.52%)
Mutual labels:  pacbio
human genomics pipeline
A Snakemake workflow to process single samples or cohorts of paired-end sequencing data (WGS or WES) using trim galore/bwa/GATK4/parabricks.
Stars: ✭ 19 (-76.54%)
Mutual labels:  illumina
Hybrid-Web-Platform
Full-fledged WebView as Xamarin.Forms plugin with cross-platform C# to JavaScript and JavaScript to C# calls support. Eventually invented for painless hybrid apps creation.
Stars: ✭ 19 (-76.54%)
Mutual labels:  hybrid
streamformatics
Real-time species-typing visualisation for nanopore data.
Stars: ✭ 13 (-83.95%)
Mutual labels:  nanopore
awesome-nanopore
A curated list of awesome nanopore analysis tools.
Stars: ✭ 100 (+23.46%)
Mutual labels:  nanopore
minorseq
Minor Variant Calling and Phasing Tools
Stars: ✭ 15 (-81.48%)
Mutual labels:  pacbio
refu
Refu language
Stars: ✭ 21 (-74.07%)
Mutual labels:  hybrid
py-trueconsensus
python prototype for hybrid consensus
Stars: ✭ 48 (-40.74%)
Mutual labels:  hybrid
chihu
ionic2-example <吃乎>一款美食app 🍜 ☕️ 🍦 (This is a support android and apple ionic2 case, a food app)
Stars: ✭ 64 (-20.99%)
Mutual labels:  hybrid

HitCount

Wengan

An accurate and ultra-fast genome assembler

Version: 0.2 (18/05/2020)

Table of Contents

SYNOPSIS

# Assembling Oxford Nanopore and Illumina reads with WenganM
 wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000

# Assembling PacBio reads and Illumina reads with WenganA
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and BGI reads with WenganM
 wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000

# Hybrid long-read only assembly of PacBio Circular Consensus Sequence and Nanopore data with WenganM
 wengan.pl -x ccsont -a M -l ont.fastq.gz -b ccs.fastq.gz -p asm4 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and Illumina reads with WenganD (need a high memory machine 600GB)
 wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000

# Assembling pacraw reads with pre-assembled short-read contigs from Minia3
 wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa

# Assembling pacraw reads with pre-assembled short-read contigs from Abyss
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa

# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo
 wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa

Description

Wengan is a new genome assembler that, unlike most of the current long-reads assemblers, avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph. To achieve this, Wengan builds a new sequence graph called the Synthetic Scaffolding Graph (SSG). The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by performing a transitive reduction of the edges. Another distinct feature of Wengan is that it performs self-validation by following the read information. Wengan identifies miss-assemblies at different steps of the assembly process. For more information about the algorithmic ideas behind Wengan, please read the preprint available in bioRxiv.

Short-read assembly

Wengan uses a de Bruijn graph assembler to build the assembly backbone from short-read data. Currently, Wengan can use Minia3, Abyss2 or DiscoVarDenovo. The recommended short-read coverage is 50-60X of 2 x 150bp or 2 x 250bp reads.

WenganM [M]

This Wengan mode uses the Minia3 short-read assembler. This is the fastest mode of Wengan and can assemble a complete human genome in less than 210 CPU hours (~50GB of RAM).

WenganA [A]

This Wengan mode uses the Abyss2 short-read assembler. This is the lowest memory mode of Wengan and can assemble a complete human genome with less than 40GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

WenganD [D]

This Wengan mode uses the DiscovarDenovo short-read assembler. This is the greedier memory mode of Wengan and for assembling a complete human genome needs about 600GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

Long-read presets

The presets define several variables of the Wengan pipeline execution and depend on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X.

ontlon

preset for raw ultra-long-reads from Oxford Nanopore, typically with an N50 > 50kb.

ontraw

preset for raw Nanopore reads typically with an N50 ~[15kb-40kb].

pacraw

preset for raw long-reads from Pacific Bioscience (PacBio) typically with an N50 ~[8kb-60kb].

pacccs (experimental)

preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically with an N50 ~[15kb]. This type of data is not fully supported yet.

Wengan demo

The repository wengan_demo contains a small dataset and instructions to test Wengan v0.2.

#fetch the demo dataset
git clone https://github.com/adigenova/wengan_demo.git

Wengan benchmark

Genome Long reads Short reads Wengan Mode NG50 (Mb) CPU (h) RAM (GB) Fasta file
2x150bp 50X (GIAB:rs1 , rs2) WenganA 25.99 725 45 asm
NA12878 ONT 35X (rel5) 2x150bp 50X (GIAB:rs1 , rs2) WenganM 17.23 203 53 asm
2x250bp 60X (ENA:rs1 , rs2) WenganD 35.31 589 622 asm
HG00073 PAC 90X (ENA:rl1) 2x250bp 63X (ENA:rs1 , rs2) WenganD 32.35 936 644 asm
NA24385 ONT 60X (GIAB:rl1) 2x250bp 70X (GIAB:rs1) WenganD 50.59 963 651 asm
CHM13 ONT 50X (T2T:rel3) 2x250bp 66X (ENA:rs1 , rs2) WenganD 69.72 1198 646 asm

The assemblies generated using Wengan (v0.2) can be downloaded from Zenodo. All the assemblies were ran as described in the Wengan manuscript. NG50 was computed using a genome size of 3.08Gb.

Wengan components

Getting the latest source code

Instructions

It is recommended to use/download the latest binary release (Linux) from : https://github.com/adigenova/wengan/releases

Containers

To facilitate the execution of Wengan, we provide docker/singularity containers. Wengan images are hosted on Dockerhub and can be downloaded with the command:

docker pull adigenova/wengan:v0.2

Alternatively, using singularity:

export TMPDIR=/tmp
singularity pull docker://adigenova/wengan:v0.2

Run WenganM using singularity

#using singularity
CONTAINER=/path_to_container/wengan_v0.2.sif

#location of wengan in the container
WENGAN=/wengan/wengan-v0.2-bin-Linux/wengan.pl

#run WenganM with singularity exec
singularity exec $CONTAINER perl ${WENGAN} \
 -x pacraw \
 -a M \
 -s short.R1.fastq.gz,short.R2.fastq.gz \
 -l pacbio.clr.fastq.gz \
 -p asm_wengan -t 20 -g 3000

Building Wengan from source

To compile Wengan run the following command:

#fetch Wengan and its components
git clone --recursive https://github.com/adigenova/wengan.git wengan

There are specific instructions for each Wengan component. After compilation you have to copy the binaries to wengan-dir/bin.

Requirements

c++ compiler; compilation was tested with gcc version GCC/7.3.0-2.30 (Linux) and clang-1000.11.45.5 (Mac OSX). cmake 3.2+.

Specific component source code versions used to build Wengan v0.2

  1. abyss commit d4b4b5d
  2. discovarexp-51885 commit f827bab
  3. minia commit 017d23e
  4. fastmin-sg commit 861b061
  5. intervalmiss commit 11be8b42
  6. liger commit 63a044b0
  7. seqtk commit 2efd0c8

Limitations

1.- Genomes larger than 4Gb are not supported yet.

About the name

Wengan is a Mapudungun word. Mapudungun is the language of the Mapuche people, the largest indigenous inhabitants of south-central Chile. Wengan means "Making the path".

Citation

Di Genova, A., Buena-Atienza, E., Ossowski, S. and Sagot,M-F. Efficient hybrid de novo assembly of human genomes with WENGAN. Nature Biotechnology (2020), link

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].