All Projects → vpc-ccg → haslr

vpc-ccg / haslr

Licence: GPL-3.0 license
A fast tool for hybrid genome assembly of long and short reads

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
c
50402 projects - #5 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to haslr

tiptoft
Predict plasmids from uncorrected long read data
Stars: ✭ 27 (-60.29%)
Mutual labels:  nanopore, genomics, pacbio, long-reads
dentist
Close assembly gaps using long-reads at high accuracy.
Stars: ✭ 39 (-42.65%)
Mutual labels:  pacbio, genome-assembly, long-reads
Clair3
Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
Stars: ✭ 119 (+75%)
Mutual labels:  nanopore, genomics, long-reads
fast-sg
Fast-SG: An alignment-free algorithm for ultrafast scaffolding graph construction from short or long reads.
Stars: ✭ 22 (-67.65%)
Mutual labels:  nanopore, pacbio, genome-assembly
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (-67.65%)
Mutual labels:  genomics, pacbio, genome-assembly
mccortex
De novo genome assembly and multisample variant calling
Stars: ✭ 105 (+54.41%)
Mutual labels:  genomics, genome-assembly
wengan
An accurate and ultra-fast hybrid genome assembler
Stars: ✭ 81 (+19.12%)
Mutual labels:  nanopore, pacbio
pepper
PEPPER-Margin-DeepVariant
Stars: ✭ 179 (+163.24%)
Mutual labels:  nanopore, long-reads
instaGRAAL
Large genome reassembly based on Hi-C data, continuation of GRAAL
Stars: ✭ 32 (-52.94%)
Mutual labels:  genomics, genome-assembly
RATTLE
Reference-free reconstruction and error correction of transcriptomes from Nanopore long-read sequencing
Stars: ✭ 35 (-48.53%)
Mutual labels:  nanopore, genomics
pipeline-structural-variation
Pipeline for calling structural variations in whole genomes sequencing Oxford Nanopore data
Stars: ✭ 104 (+52.94%)
Mutual labels:  nanopore, long-reads
interARTIC
InterARTIC - An interactive local web application for viral whole genome sequencing utilising the artic network pipelines..
Stars: ✭ 22 (-67.65%)
Mutual labels:  nanopore, genomics
CAMSA
CAMSA: a tool for Comparative Analysis and Merging of Scaffold Assemblies
Stars: ✭ 18 (-73.53%)
Mutual labels:  genomics, genome-assembly
indelope
find large indels (in the blind spot between GATK/freebayes and SV callers)
Stars: ✭ 38 (-44.12%)
Mutual labels:  genomics, genome-assembly
redundans
Redundans is a pipeline that assists an assembly of heterozygous/polymorphic genomes.
Stars: ✭ 90 (+32.35%)
Mutual labels:  genomics, genome-assembly
Winnowmap
Long read / genome alignment software
Stars: ✭ 151 (+122.06%)
Mutual labels:  nanopore, pacbio
IsoQuant
Reference-based transcript discovery from long RNA read
Stars: ✭ 26 (-61.76%)
Mutual labels:  nanopore, pacbio
berokka
🍊 💫 Trim, circularise and orient long read bacterial genome assemblies
Stars: ✭ 23 (-66.18%)
Mutual labels:  genomics, genome-assembly
LRSDAY
LRSDAY: Long-read Sequencing Data Analysis for Yeasts
Stars: ✭ 26 (-61.76%)
Mutual labels:  genomics, genome-assembly
vrs-python
GA4GH Variation Representation Python Implementation
Stars: ✭ 35 (-48.53%)
Mutual labels:  genomics

HASLR: fast hybrid assembly of long reads

Introduction

HASLR is a tool for rapid genome assembly of long sequencing reads. HASLR is a hybrid tool which means it requires long reads generated by Third Generation Sequencing technologies (such as PacBio or Oxford Nanopore) together with Next Generation Sequencing reads (such as Illumina) from the same sample. HASLR is capable of assembling large genomes on a single computing node. Our experiments show that it can assemble a CHM1 human dataset in less than 10 hours using 64 CPU threads.

Installation

Requirements

  • GCC ≥ 4.8.5
  • Python3
  • zlib

Dependencies

HASLR depends on the following tools which will be installed automatically:

  • SPOA - For consensus calling of long reads
  • Minia - For assembling short reads
  • minimap2 - For aligning short read contigs onto long reads
  • fastutils - For FASTA/Q manipulation

Using conda

HASLR can be installed using conda package manager via bioconda channel:

conda install -c bioconda haslr

Building from source

git clone https://github.com/vpc-ccg/haslr.git
cd haslr
make

After a successful build, the content of bin directory should be as the following:

bin/fastutils
bin/haslr_assemble
bin/haslr.py
bin/minia
bin/minimap2
bin/minia_nooverlap

Note that bin/haslr.py is the main python wrapper of HASLR.

Command line

haslr.py [-t THREADS] -o OUT_DIR -g GENOME_SIZE -l LONG -x LONG_TYPE -s SHORT [SHORT ...]

Options

required arguments:
  -o, --out OUT_DIR              output directory
  -g, --genome GENOME_SIZE       estimated genome size; accepted suffixes are k,m,g
  -l, --long LONG                long read file
  -x, --type LONG_TYPE           type of long reads chosen from {pacbio,nanopore}
  -s, --short SHORT [SHORT ...]  short read file

optional arguments:
  -t, --threads THREADS          number of CPU threads to use [1]
  --cov-lr COV_LR                amount of long read coverage to use for assembly [25]
  --aln-block ALN_BLOCK          minimum length of alignment block [500]
  --aln-sim ALN_SIM              minimum alignment similarity [0.85]
  --edge-sup EDGE_SUP            minimum number of long read supporting each edge [3]
  --minia-kmer MINIA_KMER        kmer size used by minia [49]
  --minia-solid MINIA_SOLID      minimum kmer abundance used by minia [3]
  --minia-asm MINIA_ASM          type of minia assembly chosen from {contigs,unitigs} [contigs]
  -v, --version                  print version
  -h, --help                     show this help message and exit

Quick start

Here we assemble a sample E. coli dataset of PacBio and Illumina reads:

# download PacBio reads
wget http://gembox.cbcb.umd.edu/mhap/raw/ecoli_filtered.fastq.gz
# download Illumina reads
wget http://gembox.cbcb.umd.edu/mhap/raw/ecoli_miseq.1.fastq.gz
wget http://gembox.cbcb.umd.edu/mhap/raw/ecoli_miseq.2.fastq.gz
# run HASLR using 8 threads
haslr.py -t 8 -o ecoli -g 4.6m -l ecoli_filtered.fastq.gz -x pacbio -s ecoli_miseq.1.fastq.gz ecoli_miseq.2.fastq.gz

Output files

With a successful run, the structure of the output directory will be as follows:

ecoli                                               # output directory
├── asm_contigs_k49_a3_lr25x_b500_s3_sim0.85        # output directory containing long read assembly files
│   ├── asm.final.ann                               # annotation of the final assembly
│   ├── asm.final.fa                                # finall assembly in FASTA format
│   ├── backbone.01.init.gfa                        # initial backbone graph
│   ├── backbone.01.init.stat                       # statistics of graph stored in backbone.01.init.gfa
│   ├── backbone.02.weakEdge.gfa                    # backbone graph after removing weak edges
│   ├── backbone.02.weakEdge.stat                   # statistics of graph stored in backbone.02.weakEdge.gfa
│   ├── backbone.03.tip.gfa                         # backbone graph after tip removal
│   ├── backbone.03.tip.log                         # log of HASLR (step: tip removal)
│   ├── backbone.03.tip.stat                        # statistics of graph stored in backbone.03.tip.gfa
│   ├── backbone.04.simplebubble.gfa                # backbone graph after simple bubble removal
│   ├── backbone.04.simplebubble.log                # log of HASLR (step: simple bubble removal)
│   ├── backbone.04.simplebubble.stat               # statistics of graph stored in backbone.04.simplebubble.gfa
│   ├── backbone.05.superbubble.gfa                 # backbone graph after super bubble removal
│   ├── backbone.05.superbubble.log                 # log of HASLR (step: super bubble removal)
│   ├── backbone.05.superbubble.stat                # statistics of graph stored in backbone.05.superbubble.gfa
│   ├── backbone.06.smallbubble.gfa                 # backbone graph after small bubble removal 
│   ├── backbone.06.smallbubble.log                 # log of HASLR (step: small bubble removal)
│   ├── backbone.06.smallbubble.stat                # statistics of graph stored in backbone.06.smallbubble.gfa
│   ├── backbone.branching.log                      # list of branching nodes in backbone graph
│   ├── compact_uniq.txt                            # compact representation of long reads
│   ├── index.contig                                # contig index generated by HASLR
│   ├── index.longread                              # long read index generated by HASLR
│   ├── log_asmfinal.txt                            # log of HASLR (step: generating the assembly from the cleaned backbone graph)
│   ├── log_consensus.txt                           # log of HASLR (step: calling consensus sequence between anchors)
│   └── log_coordinate.txt                          # log of HASLR (step: calculating long read coordinates between anchors)
├── asm_contigs_k49_a3_lr25x_b500_s3_sim0.85.err    # log of HASLR
├── asm_contigs_k49_a3_lr25x_b500_s3_sim0.85.out    # 
├── lr25x.fasta                                     # longest 25x coverage of long reads
├── map_contigs_k49_a3_lr25x.log                    # log of minimap2
├── map_contigs_k49_a3_lr25x.paf                    # alignments of short read contigs onto long reads
├── sr.fofn                                         # list of short read files; Minia's input
├── sr_k49_a3.contigs.fa                            # short read contigs generated by Minia
├── sr_k49_a3.contigs.nooverlap.250.fa              # short read contigs at least 250 bp in length without overlaps
├── sr_k49_a3.contigs.nooverlap.fa                  # short read contigs without overlaps
├── sr_k49_a3.h5                                    # 
├── sr_k49_a3.log                                   # log of Minia
└── sr_k49_a3.unitigs.fa                            # short read unitigs generated by Minia

Important notes:

  • The name of output files and folders with be changed depending on the parameters passed to haslr.py.
  • If you like to run HASLR with multiple parameters, pass the same output directory via -o/--out. In this case, HASLR will reuse already existing files without creating them from scratch. This will speed up your experiments when testing with multiple parameter sets.
  • The backbone graph is stored in GFA format which can be visualized using Bandage. For instance, you can investigage backbone.06.smallbubble.gfa which is the backbone graph after all simplifications.

Preprint

Haghshenas E., Asghari H., Stoye J., Chauve C., and Hach F. (2020) bioRxiv. doi:10.1101/2020.01.27.921817

Bug report

Please report the bugs through HASLR's issue tracker at https://github.com/vpc-ccg/haslr/issues.

Copyright and License

This software is released under GNU General Public License (v3.0)

  • SPOA is released under MIT license
  • Minia is released under AGPL license
  • minimap2 is released under MIT license
  • fastutils is released under GPL license

Author

Ehsan Haghshenas (ehaghshe AT sfu DOT ca)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].