All Projects → lpryszcz → redundans

lpryszcz / redundans

Licence: GPL-2.0 license
Redundans is a pipeline that assists an assembly of heterozygous/polymorphic genomes.

Programming Languages

perl
6916 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to redundans

mccortex
De novo genome assembly and multisample variant calling
Stars: ✭ 105 (+16.67%)
Mutual labels:  genomics, genome-assembly, contigs
instaGRAAL
Large genome reassembly based on Hi-C data, continuation of GRAAL
Stars: ✭ 32 (-64.44%)
Mutual labels:  genomics, scaffolding, genome-assembly
Flowcraft
FlowCraft: a component-based pipeline composer for omics analysis using Nextflow. 🐳📦
Stars: ✭ 208 (+131.11%)
Mutual labels:  pipeline, genomics
Bedops
🔬 BEDOPS: high-performance genomic feature operations
Stars: ✭ 215 (+138.89%)
Mutual labels:  pipeline, genomics
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (-75.56%)
Mutual labels:  genomics, genome-assembly
Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (+802.22%)
Mutual labels:  pipeline, genomics
Sns
Analysis pipelines for sequencing data
Stars: ✭ 43 (-52.22%)
Mutual labels:  pipeline, genomics
berokka
🍊 💫 Trim, circularise and orient long read bacterial genome assemblies
Stars: ✭ 23 (-74.44%)
Mutual labels:  genomics, genome-assembly
human genomics pipeline
A Snakemake workflow to process single samples or cohorts of paired-end sequencing data (WGS or WES) using trim galore/bwa/GATK4/parabricks.
Stars: ✭ 19 (-78.89%)
Mutual labels:  pipeline, genomics
LRSDAY
LRSDAY: Long-read Sequencing Data Analysis for Yeasts
Stars: ✭ 26 (-71.11%)
Mutual labels:  genomics, genome-assembly
companion
This repository has been archived, currently maintained version is at https://github.com/iii-companion/companion
Stars: ✭ 21 (-76.67%)
Mutual labels:  pipeline, genomics
bactmap
A mapping-based pipeline for creating a phylogeny from bacterial whole genome sequences
Stars: ✭ 36 (-60%)
Mutual labels:  pipeline, genomics
MTBseq source
MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.
Stars: ✭ 26 (-71.11%)
Mutual labels:  pipeline, genomics
Sarek
Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing
Stars: ✭ 124 (+37.78%)
Mutual labels:  pipeline, genomics
get phylomarkers
A pipeline to select optimal markers for microbial phylogenomics and species tree estimation using coalescent and concatenation approaches
Stars: ✭ 34 (-62.22%)
Mutual labels:  pipeline, genomics
cljam
A DNA Sequence Alignment/Map (SAM) library for Clojure
Stars: ✭ 85 (-5.56%)
Mutual labels:  genomics, fasta
perf
PERF is an Exhaustive Repeat Finder
Stars: ✭ 26 (-71.11%)
Mutual labels:  genomics, fasta
CAMSA
CAMSA: a tool for Comparative Analysis and Merging of Scaffold Assemblies
Stars: ✭ 18 (-80%)
Mutual labels:  genomics, genome-assembly
gawn
Genome Annotation Without Nightmares
Stars: ✭ 35 (-61.11%)
Mutual labels:  pipeline, genomics
Pairfq
Sync paired-end FASTA/Q files and keep singleton reads
Stars: ✭ 18 (-80%)
Mutual labels:  fasta, paired-end

Table of Contents

Redundans

Redundans pipeline assists an assembly of heterozygous genomes.
Program takes as input assembled contigs, sequencing libraries and/or reference sequence and returns scaffolded homozygous genome assembly. Final assembly should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding.

The pipeline consists of several steps (modules):

  1. de novo contig assembly (optional if no contigs are given)
  2. redundancy reduction: detection and selective removal of redundant contigs from an initial de novo assembly
  3. scaffolding: joining of genome fragments using paired-end reads, mate-pairs, long reads and/or reference chromosomes
  4. gap closing: filling the gaps after scaffolding using paired-end and/or mate-pair reads

Redundans is:

  • fast & lightweight, multi-core support and memory-optimised, so it can be run even on the laptop for small-to-medium size genomes
  • flexible toward many sequencing technologies (Illumina, 454, Sanger, PacBio & Nanopore) and library types (paired-end, mate pairs, fosmids, long reads)
  • modular: every step can be omitted or replaced by other tools
  • reliable: it has been already used to improve genome assemblies varying in size (several Mb to several Gb) and complexity (fungal, animal & plants)

For more information have a look at the documentation, poster, publication, test dataset or manual.

Prerequisites

Redundans uses several programs (all provided within this repository):

On most Linux distros, the installation should be as easy as:

git clone --recursive https://github.com/lpryszcz/redundans.git
cd redundans && bin/.compile.sh

If it fails, make sure you have below dependencies installed:

  • Python 2.7 or 2.6
  • Perl [SSPACE3]
  • make, gcc & g++ [BWA & LAST] ie. sudo apt-get install make gcc g++
  • zlib including zlib.h headers [BWA] ie. sudo apt-get install zlib1g-dev
  • optionally for plotting numpy and matplotlib ie. sudo -H pip install -U matplotlib numpy

For user convenience, we provide UNIX installer and Docker image, that can be used instead of manually installation.

Unofficial conda package

If you are familiar with conda, this will be by far the easiest way of installing redundans:

# create new Python2 environment
conda create -n redundans python=2.7
# activate it
conda activate redundans
# and install redundans
conda install -c genomedk redundans 

Note, this is unofficial channel and may not be completely up-to-date with this repo.

UNIX installer

UNIX installer will automatically fetch, compile and configure Redundans together with all dependencies. It should work on all modern Linux systems, given Python 2.7, commonly used programmes (ie. wget, curl, git, perl, gcc, g++, ldconfig) and libraries (zlib including zlib.h) are installed.

source <(curl -Ls http://bit.ly/redundans_installer)

Docker image

First, you need to install docker: wget -qO- https://get.docker.com/ | sh
Then, you can run the test example by executing:

# process the data inside the image - all data will be lost at the end
docker run -it -w /root/src/redundans lpryszcz/redundans ./redundans.py -v -i test/{600,5000}_{1,2}.fq.gz -f test/contigs.fa -o test/run1

# if you wish to process local files, you need to mount the volume with -v
## make sure you are in redundans repo directory (containing test/ directory)
docker run -v `pwd`/test:/test:rw -it lpryszcz/redundans /root/src/redundans/redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1

Docker images are very handy, but they have certain limitation. The most annoying for me is the lack of autocompletion, unless you specify the path in host and container in the exactly same manner as in the example above. In addition, the volume needs to be mounted every time, leading to a bit complex commands.

Running the pipeline

Redundans input consists of any combination of:

  • assembled contigs (FastA)
  • paired-end and/or mate pairs reads (FastQ*)
  • long reads (FastQ/FastA*) - both PacBio and Nanopore are supported
  • and/or reference chromosomes/contigs (FastA).
  • gzipped files are also accepted.

Redundans will return homozygous genome assembly in scaffolds.filled.fa (FastA).
In addition, the program reports statistics for every pipeline step, including number of contigs that were removed, GC content, N50, N90 and size of gap regions.

Parameters

For the user convenience, Redundans is equipped with a wrapper that automatically estimates run parameters and executes all steps/modules. You should specify some sequencing libraries (FastA/FastQ) or reference sequence (FastA) in order to perform scaffolding. If you don't specify -f contigs (FastA), Redundans will assemble contigs de novo, but you'll have to provide paired-end and/or mate pairs reads (FastQ). Most of the pipeline parameters can be adjusted manually (default values are given in square brackets []):
HINT: If you run fails, you may try to resume it, by adding --resume parameter.

  • General options:
  -h, --help            show this help message and exit
  -v, --verbose         verbose
  --version             show program's version number and exit
  -i FASTQ, --fastq FASTQ
                        FASTQ PE / MP files
  -f FASTA, --fasta FASTA
                        FASTA file with contigs / scaffolds
  -o OUTDIR, --outdir OUTDIR
                        output directory [redundans]
  -t THREADS, --threads THREADS
                        no. of threads to run [4]
  --resume              resume previous run
  --log LOG             output log to [stderr]
  --nocleaning
  • Reduction options:
  --identity IDENTITY   min. identity [0.51]
  --overlap OVERLAP     min. overlap  [0.80]
  --minLength MINLENGTH
                        min. contig length [200]
  --noreduction         Skip reduction
  • Short-read scaffolding options:
  -j JOINS, --joins JOINS
                        min pairs to join contigs [5]
  -a LINKRATIO, --linkratio LINKRATIO
                        max link ratio between two best contig pairs [0.7]
  --limit LIMIT         align subset of reads [0.2]
  -q MAPQ, --mapq MAPQ  min mapping quality [10]
  --iters ITERS         iterations per library [2]
  --noscaffolding       Skip short-read scaffolding
  • Long-read scaffolding options:
  -l LONGREADS, --longreads LONGREADS
                        FastQ/FastA files with long reads
  --identity IDENTITY   min. identity [0.51]
  --overlap OVERLAP     min. overlap  [0.80]
  • Reference-based scaffolding options:
  -r REFERENCE, --reference REFERENCE
                        reference FastA file
  --norearrangements    high identity mode (rearrangements not allowed)
  --identity IDENTITY   min. identity [0.51]
  --overlap OVERLAP     min. overlap  [0.80]
  • Gap closing options:
  --iters ITERS         iterations per library [2]
  --nogapclosing                        

Redundans is extremely flexible. All steps of the pipeline can be ommited using: --noreduction, --noscaffolding and/or --nogapclosing parameters.

Test run

To run the test example, execute:

./redundans.py -v -i test/*_?.fq.gz -f test/contigs.fa -o test/run1

# if your run failed for any reason, you can try to resume it
rm test/run1/_sspace.2.1.filled.fa
./redundans.py -v -i test/*_?.fq.gz -f test/contigs.fa -o test/run1 --resume

# if you have no contigs assembled, just run without `-f`
./redundans.py -v -i test/*_?.fq.gz -o test/run.denovo

Note, the order of libraries (-i/--input) is not important, as long as read1 and read2 from each library are given one after another i.e. -i 600_1.fq.gz 600_2.fq.gz 5000_1.fq.gz 5000_2.fq.gz would be interpreted the same as -i 5000_1.fq.gz 5000_2.fq.gz 600_1.fq.gz 600_2.fq.gz.

You can play with any combination of inputs ie. paired-end, mate pairs, long reads and / or reference-based scaffolding, for example:

# reduction, scaffolding with paired-end, mate pairs and long reads, and gap closing with paired-end and mate pairs
./redundans.py -v -i test/*_?.fq.gz -l test/pacbio.fq.gz test/nanopore.fa.gz -f test/contigs.fa -o test/run_short_long

# scaffolding and gap closing with paired-end and mate pairs (no reduction)
./redundans.py -v -i test/*_?.fq.gz -f test/contigs.fa -o test/run_short-scaffolding-closing --noreduction

# reduction, reference-based scaffolding and gap closing with paired-end reads (--noscaffolding disables only short-read scaffolding)
./redundans.py -v -i test/600_?.fq.gz -r test/ref.fa -f test/contigs.fa -o test/run_ref_pe-closing --noscaffolding

For more details have a look in test directory.

Support

If you have any issues or doubts check documentation and FAQ (Frequently Asked Questions). You may want also to sign to our forum.

Citation

Leszek P. Pryszcz and Toni Gabaldón (2016) Redundans: an assembly pipeline for highly heterozygous genomes. NAR. doi: 10.1093/nar/gkw294

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].