All Projects → enormandeau → gawn

enormandeau / gawn

Licence: other
Genome Annotation Without Nightmares

Programming Languages

perl
6916 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to gawn

redundans
Redundans is a pipeline that assists an assembly of heterozygous/polymorphic genomes.
Stars: ✭ 90 (+157.14%)
Mutual labels:  pipeline, genomics
tailseeker
Software for measuring poly(A) tail length and 3′-end modifications using a high-throughput sequencer
Stars: ✭ 17 (-51.43%)
Mutual labels:  pipeline, transcriptome
EDTA
Extensive de-novo TE Annotator
Stars: ✭ 210 (+500%)
Mutual labels:  pipeline, genome-annotation
varsome-api-client-python
Example client programs for Saphetor's VarSome annotation API
Stars: ✭ 21 (-40%)
Mutual labels:  genomics, genome-annotation
Sarek
Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing
Stars: ✭ 124 (+254.29%)
Mutual labels:  pipeline, genomics
human genomics pipeline
A Snakemake workflow to process single samples or cohorts of paired-end sequencing data (WGS or WES) using trim galore/bwa/GATK4/parabricks.
Stars: ✭ 19 (-45.71%)
Mutual labels:  pipeline, genomics
MTBseq source
MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.
Stars: ✭ 26 (-25.71%)
Mutual labels:  pipeline, genomics
RATTLE
Reference-free reconstruction and error correction of transcriptomes from Nanopore long-read sequencing
Stars: ✭ 35 (+0%)
Mutual labels:  genomics, transcriptome
Sns
Analysis pipelines for sequencing data
Stars: ✭ 43 (+22.86%)
Mutual labels:  pipeline, genomics
Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (+2220%)
Mutual labels:  pipeline, genomics
EarlGrey
Earl Grey: A fully automated TE curation and annotation pipeline
Stars: ✭ 25 (-28.57%)
Mutual labels:  genomics, genome-annotation
Bedops
🔬 BEDOPS: high-performance genomic feature operations
Stars: ✭ 215 (+514.29%)
Mutual labels:  pipeline, genomics
wgs2ncbi
Toolkit for preparing genomes for submission to NCBI
Stars: ✭ 25 (-28.57%)
Mutual labels:  genomics, genome-annotation
bactmap
A mapping-based pipeline for creating a phylogeny from bacterial whole genome sequences
Stars: ✭ 36 (+2.86%)
Mutual labels:  pipeline, genomics
TOGA
TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
Stars: ✭ 35 (+0%)
Mutual labels:  genomics, genome-annotation
get phylomarkers
A pipeline to select optimal markers for microbial phylogenomics and species tree estimation using coalescent and concatenation approaches
Stars: ✭ 34 (-2.86%)
Mutual labels:  pipeline, genomics
LRSDAY
LRSDAY: Long-read Sequencing Data Analysis for Yeasts
Stars: ✭ 26 (-25.71%)
Mutual labels:  genomics, genome-annotation
GCModeller
GCModeller: genomics CAD(Computer Assistant Design) Modeller system in .NET language
Stars: ✭ 25 (-28.57%)
Mutual labels:  genomics, genome-annotation
companion
This repository has been archived, currently maintained version is at https://github.com/iii-companion/companion
Stars: ✭ 21 (-40%)
Mutual labels:  pipeline, genomics
Flowcraft
FlowCraft: a component-based pipeline composer for omics analysis using Nextflow. 🐳📦
Stars: ✭ 208 (+494.29%)
Mutual labels:  pipeline, genomics

GAWN v0.3.5

Genome Annotation Without Nightmares

Developed by Eric Normandeau in Louis Bernatchez's laboratory with suggestions and important code contributions from Jérémy Le Luyer.

Description

GAWN is a genome annotation pipeline that uses an assembled transcriptome (in nucleoties, not amino acids), either from the same species or from a related species, to create an evidence-based genome annotation. Its primary goal is to provide good enough genome annotation with a fraction of the time and effort required to run more complete genome annotation pipelines. It uses existing tools, such as GMAP, TransDecoder, blastx, the Swissprot database, etc. to produce the annotation. The result files are:

  • A GFF3 annotation file
  • A transcript annotation .tsv table
  • A genome annotation .tsv table

The .tsv tables are formatted to maximize usability by non-specialized users.

Use cases

This approach is especially useful to annotate genomes of species for which there is a good assembled transcriptome. It will also work when a good transcriptome is available for a related species. It provides only gene annotations for available transcripts. As such, it does not depend on ab initio gene prediction models.

Overview of the analyses

During the analyses, the following steps are performed:

  • Index the genome (GMAP)
  • Annotate genes using available transcripts (GMAP)
  • Annotate the transcripts (blastx and the Swissprot database)
  • Produce a transcriptome annotation table (Python script)
  • Produce a genome annotation table (Python script)
  • TODO: add CpG island annotations

Resources needed

GAWN depends on different tools to annotate genomes. The requirements in terms of RAM, disk space, and time, is dependent on these tools. Here are example requirements for three different eukaryote genomes. The annotation was run on a Lenovo ThinkStation D20 with 8 Xeon CPUs (16 threads, 2.40GHz) on Linux Mint 17 (Ubuntu 16.04). All of these datasets, except Salvelinus fontinalis were run using the most recent genomes and transcriptomes available from Genbank.

Genome Size (Gbp) RAM (GB) Final disc space (GB) Time (h)
Human genome 3.29 16 37 ~48
Salvelinus fontinalis 2.67 14.3 31.2 ~48
Drosophila melanogaster 1.45 10.2 3.1 28

Installation

To use GAWN, you will need a local copy of its repository, which can be found here. Just download and unzip the folder. Use a new downloaded folder for each analysis.

Different releases can be accessed here. We suggest using the latest release. Avoid any release prior to 0.3.1. Some of these older releases are broken for some versions of the dependencies).

Dependencies

You will also need to have the following programs installed on your computer. The version numbers are the ones that have been tested. It is suggested that you use these or more recent versions.

  • GNU Linux or OSX
  • bash 4+
  • python 2.7+ or 3.6+
  • cufflinks v2.2.1+
  • gmap (2017-10-12)
  • wget 1.17.1
  • gnu parallel 2017xxxx+
  • blastplus utilities (blastx) 2.7.1+ (Very important, do not use old blastplus binaries)
  • a local copy of the swissprot database: ftp://ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz

The relevant TransDecoder scripts are included with their license in 01_scripts/TransDecoder.

Running the pipeline

For each new project, get a new copy of GAWN's repository from the sources listed in the Installation section and copy your data in the 03_data folder.

  • Install dependencies
  • Download GAWN repository (see Installation section above)
  • Put your genome and transcriptome fasta files (uncompressed) in 03_data
  • Edit the parameters in 02_info/gawn_config.sh (you can rename the file)
  • Run the following command:
./gawn 02_infos/gawn_config.sh  # or your renamed file

Results

Once the pipeline has completed, all result files are found in the 05_results folder.

  • A valid gff3 annotation file
  • A transcriptome annotation .tsv table
  • A genome annotation .tsv table

License

CC share-alike

Creative Commons Licence
GAWN by Eric Normandeau is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://github.com/enormandeau/gawn.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].