All Projects → tseemann → berokka

tseemann / berokka

Licence: GPL-3.0 license
🍊 💫 Trim, circularise and orient long read bacterial genome assemblies

Programming Languages

perl
6916 projects
Makefile
30231 projects

Projects that are alternatives of or similar to berokka

instaGRAAL
Large genome reassembly based on Hi-C data, continuation of GRAAL
Stars: ✭ 32 (+39.13%)
Mutual labels:  genomics, genome-assembly
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (-4.35%)
Mutual labels:  genomics, genome-assembly
LRSDAY
LRSDAY: Long-read Sequencing Data Analysis for Yeasts
Stars: ✭ 26 (+13.04%)
Mutual labels:  genomics, genome-assembly
haslr
A fast tool for hybrid genome assembly of long and short reads
Stars: ✭ 68 (+195.65%)
Mutual labels:  genomics, genome-assembly
redundans
Redundans is a pipeline that assists an assembly of heterozygous/polymorphic genomes.
Stars: ✭ 90 (+291.3%)
Mutual labels:  genomics, genome-assembly
CAMSA
CAMSA: a tool for Comparative Analysis and Merging of Scaffold Assemblies
Stars: ✭ 18 (-21.74%)
Mutual labels:  genomics, genome-assembly
indelope
find large indels (in the blind spot between GATK/freebayes and SV callers)
Stars: ✭ 38 (+65.22%)
Mutual labels:  genomics, genome-assembly
mccortex
De novo genome assembly and multisample variant calling
Stars: ✭ 105 (+356.52%)
Mutual labels:  genomics, genome-assembly
Higlass
Fast large scale matrix visualization for the web.
Stars: ✭ 208 (+804.35%)
Mutual labels:  genomics
Cyvcf2
cython + htslib == fast VCF and BCF processing
Stars: ✭ 243 (+956.52%)
Mutual labels:  genomics
Minigraph
Proof-of-concept seq-to-graph mapper and graph generator
Stars: ✭ 206 (+795.65%)
Mutual labels:  genomics
Genomeworks
SDK for GPU accelerated genome assembly and analysis
Stars: ✭ 215 (+834.78%)
Mutual labels:  genomics
Biopython
Official git repository for Biopython (originally converted from CVS)
Stars: ✭ 2,936 (+12665.22%)
Mutual labels:  genomics
Flowcraft
FlowCraft: a component-based pipeline composer for omics analysis using Nextflow. 🐳📦
Stars: ✭ 208 (+804.35%)
Mutual labels:  genomics
fermi
A WGS de novo assembler based on the FMD-index for large genomes
Stars: ✭ 74 (+221.74%)
Mutual labels:  genomics
Juicer
A One-Click System for Analyzing Loop-Resolution Hi-C Experiments
Stars: ✭ 203 (+782.61%)
Mutual labels:  genomics
Sequenceserver
Intuitive local web frontend for the BLAST bioinformatics tool
Stars: ✭ 198 (+760.87%)
Mutual labels:  genomics
Mitty
Seven Bridges Genomics aligner/caller debugging and analysis tools
Stars: ✭ 13 (-43.48%)
Mutual labels:  genomics
cljam
A DNA Sequence Alignment/Map (SAM) library for Clojure
Stars: ✭ 85 (+269.57%)
Mutual labels:  genomics
Bowtie
An ultrafast memory-efficient short read aligner
Stars: ✭ 221 (+860.87%)
Mutual labels:  genomics

Fizzy orange tablet Build Status License: GPL v3 Don't judge me

berokka

Trim, circularise, orient & filter long read bacterial genome assemblies

Introduction

There is already a good piece of software to trim/circularise and orient genome assemblies called Circlator. Please try that first!

You should only try Berokka if:

  1. You only have the contig files and do not have the corrected reads anymore
  2. Your contigs are simple cases with clear overhang and could be done manually with BLAST
  3. Circlator fails on your data even after troubleshooting

NOTE: orientation to dnaA or rep genes is not yet implemented.

Installation

Homebrew

Using Homebrew will install all the dependencies for you: Linux or MacOS

brew install brewsci/bio/berokka

Conda

Using Bioconda) will take care of everything:

conda install -c conda-forge -c bioconda -c defaults berokka

Source

git clone https://github.com/tseemann/berokka.git
./berokka/bin/berokka -h

You will need to install all the dependencies manually:

  • BioPerl >= 1.6 (for Bio::SeqIO and Bio::SearchIO)
  • BLAST+ >= 2.3.0 (for blastn)

Usage

Input

Input should be completed long-read assemblies in FASTA format, such as those from CANU or HGAP.

Usage

% berokka --outdir trimdir canu.contigs.fasta
<snip>
Did you know? berokka is a play on the concept of overhang vs hangover

% ls trimdir/
01.input.fa
02.trimmed.fa
03.results.tab

% cat trimdir/03.results.tab

#sequence       status  old_len new_len trimmed
tig00000000     trimmed 5461026 5448790 12236
tig00000002     trimmed 138825  113601  25224
tig00000003     trimmed 57075   43297   13778
tig00000004     kept    24900   24900   0
tig00000006     trimmed 1620    1320    300
tig00000007     removed 2380    0       0

Output

Filename Format Description
01.input.fa FASTA All the input sequences
02.trimmed.fa FASTA The (possibly) trimmed sequences
03.results.tab TSV Summary of results

The 02.trimmed.fa output has been augmented with new header data (unless --noanno used):

  • circular=true - inform that this is a circular sequence (Rebaler uses this)
  • overhang=N - informs that N bp were trimmed off
  • len=N - the new contig length if it was present (Canu adds this)
  • suggestCircular=yes if the no version was present (Canu adds this)
  • class=replicon if the class=contig was present and we circularised

Options

  • --filter <FASTA> allows you to remove contigs which match 50% of sequences in this file. Berokka comes with the standard Pacbio control sequence. You can provide your own FASTA file using this option. If you want to disable filtering, use --filter 0.

  • --readlen LENGTH can be used for datasets that won't seem to circularise. It affects the length of the match it attempts to make using BLAST.

  • --noanno will ensure that the FASTA descriptions are not altered between the input and output FASTA files.

  • --keepfiles and --debug are primarily for use by the developer.

Etymology

Berocca is a brand of effervescent drink and vitamin tablets containing vitamin B and C. It is a popular cure for a hangover. A key role of the berokka tool is to remove the "overhang" that occurs at the ends of long-read assemblies of circular genomes.

Feedback

Please file questions, bugs or ideas to the Issue Tracker

License

GPLv3

Citation

Not published yet.

Authors

  • Torsten Seemann
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].