Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → wurmlab → Genevalidator

wurmlab / Genevalidator

Licence: agpl-3.0

GeneValidator: Identify problems with predicted genes

Programming Languages

ruby

36898 projects - #4 most used programming language

Labels

bioinformatics

Projects that are alternatives of or similar to Genevalidator

16gt

Simultaneous detection of SNPs and Indels using a 16-genotype probabilistic model

Stars: ✭ 26 (-23.53%)

Mutual labels: bioinformatics

Minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences

Stars: ✭ 912 (+2582.35%)

Mutual labels: bioinformatics

Cytometry Clustering Comparison

R scripts to reproduce analyses in our paper comparing clustering methods for high-dimensional cytometry data

Stars: ✭ 30 (-11.76%)

Mutual labels: bioinformatics

Taxadb

🐣 locally query the ncbi taxonomy

Stars: ✭ 26 (-23.53%)

Mutual labels: bioinformatics

Uncurl python

UNCURL is a tool for single cell RNA-seq data analysis.

Stars: ✭ 13 (-61.76%)

Mutual labels: bioinformatics

Sevenbridges R

Seven Bridges API Client, CWL Schema, Meta Schema, and SDK Helper in R

Stars: ✭ 27 (-20.59%)

Mutual labels: bioinformatics

Tiledb Vcf

Efficient variant-call data storage and retrieval library using the TileDB storage library.

Stars: ✭ 26 (-23.53%)

Mutual labels: bioinformatics

Metasra Pipeline

MetaSRA: normalized sample-specific metadata for the Sequence Read Archive

Stars: ✭ 33 (-2.94%)

Mutual labels: bioinformatics

Awesome Sequencing Tech Papers

A collection of publications on comparison of high-throughput sequencing technologies.

Stars: ✭ 21 (-38.24%)

Mutual labels: bioinformatics

Sv Callers

Snakemake-based workflow for detecting structural variants in WGS data

Stars: ✭ 28 (-17.65%)

Mutual labels: bioinformatics

Nonpareil

Estimate metagenomic coverage and sequence diversity

Stars: ✭ 26 (-23.53%)

Mutual labels: bioinformatics

Scanpy

Single-Cell Analysis in Python. Scales to >1M cells.

Stars: ✭ 858 (+2423.53%)

Mutual labels: bioinformatics

Workshop

课题组每周研讨会

Stars: ✭ 28 (-17.65%)

Mutual labels: bioinformatics

Pretzel

Javascript full-stack framework for Big Data visualisation and analysis

Stars: ✭ 26 (-23.53%)

Mutual labels: bioinformatics

Protr

Comprehensive toolkit for generating various numerical features of protein sequences

Stars: ✭ 30 (-11.76%)

Mutual labels: bioinformatics

Metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping

Stars: ✭ 26 (-23.53%)

Mutual labels: bioinformatics

Vdjviz

A lightweight immune repertoire browser

Stars: ✭ 21 (-38.24%)

Mutual labels: bioinformatics

Bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)

Stars: ✭ 970 (+2752.94%)

Mutual labels: bioinformatics

Fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)

Stars: ✭ 966 (+2741.18%)

Mutual labels: bioinformatics

Rasusa

Randomly subsample sequencing reads to a specified coverage

Stars: ✭ 28 (-17.65%)

Mutual labels: bioinformatics

View All Similar Projects ➔

GeneValidator - Identify problems with predicted genes

Introduction

GeneValidator helps in identifying problems with gene predictions and provide useful information extracted from analysing orthologs in BLAST databases. The results produced can be used by biocurators and researchers who need accurate gene predictions.

If you would like to use GeneValidator on a few sequences, see our online GeneValidator Web App - http://genevalidator.sbcs.qmul.ac.uk.

If you use GeneValidator in your work, please cite us as follows:

Dragan M^‡, Moghul I^‡, Priyam A, Bustos C & Wurm Y. 2016. GeneValidator: identify problems with protein-coding gene predictions. Bioinformatics, doi: 10.1093/bioinformatics/btw015.

Validations

GeneValidator runs the following validation on all input sequences:

Length: GeneValidator compares the length of the query sequence to the lengths of the most significant BLAST hits using hierarchical clustering and a rank test. This can suggest that the query is too short or too long. Graphs are dynamically produced for this validation.
Coverage: GeneValidator determines whether hit regions match the query sequence more than once using a Wilcoxon test. Significance suggests that the query includes duplicated regions (e.g., resulting from merging of tandem gene duplication).
Conserved Regions: GeneValidator performs multiple alignment of the ten most significant BLAST hits, derive a Position Specific Scoring Matrix Profile, and align this profile to the query. Results of this identify potentially missing or extra regions. Graphs are dynamically produced for this validation.
Different genes: We expect the query sequence to encode a single protein-coding gene. GeneValidator first determines whether the BLAST HSPs map to multiple regions of the query by testing for deviation from unimodality of HSP start and stop coordinates. If this is the case, GeneValidator performs a linear regression between HSP start and stop coordinates (each datapoint is weighted proportionally to the significance of the corresponding HSP). We empirically determined that regression slopes of 0.4 to 1.2 indicate that the query prediction combines two different genes. Graphs are dynamically produced for this validation.

GeneValidator also runs a further two validation on cDNA sequences:

Ab initio Open Reading Frame (ORF): Presence of more than one major ORF occurs in the presence of a frameshift, retained intron, or merged genes.
Similarity-based ORFs: We expect all BLAST hits to align within a single ORF. This test is more sensitive than the previous when a query has many BLAST hits.

Each analysis of each query returns a binary result (good vs. potential problem) according to p-value or an empirically determined cutoff. The results for each query are combined into an overall quality score from 0 to 100. Each analysis of each query returns a binary result (good vs. potential problem) according to p-value or an empirically determined cutoff. The results for each query are combined into an overall quality score from 0 to 100.

Installation

Run the following in your terminal:

# Installs in a folder called `genevalidator` in your current folder
sh -c "$(curl -fsSL https://install-genevalidator.wurmlab.com)"

# The above link is redirection to https://raw.githubusercontent.com/wurmlab/genevalidator/master/install.sh

# In order to install in a different location, add the path to the end of the above command

Alternatively, the standalone package can be manually downloaded and installed from our releases page.

Usage

GeneValidator can be run immediately after it has been installed. The below example shows how to run GV on the included exemplar data.

# assuming that installed genevalidator directory is the current working directory
genevalidator --db genevalidator/blast_db/swissprot --num_threads 4 genevalidator/exemplar_data/protein_data.fa

Other command line arguments can be viewed by running the following command.

# This should show the GeneValidator CLI help text
genevalidator -h

It is possible run GeneValidator as a web application. This graphical interface can launched by running the following command.

The path to a directory containing one or more blast databases is required - by default this points the blastdb directory in GeneValidator installation containing the SwissProt BLAST database.

genevalidator app --database_dir genevalidator/blastdb --num_threads 4

This will open the default browser at http://localhost:5678

Other GeneValidator subcommands include:

# This is for downloading pre-formatted BLAST database from NCBI
genevalidator ncbi-blast-dbs -h

# This is for creating a local web server for viewing the HTML results.
# This is necessary to view HTML result in certain browsers such as chrome.
# The exact command to run will be shown when opening the HTML result in a browser.
genevalidator serve -h

BLAST databases

GeneValidator's default database is the included Swiss-Prot database, which is used if a BLAST database is not specified. Alternative BLAST databases (such as Uniref50 or the NCBI non-redundant database) can also be used once they have been downloaded and installed. More information on how to download alternative BLAST databases and how to pass BLAST output files to GV can be found here.

Output

The output produced by GeneValidator is presented in four manners.

HTML Output

Firstly, the output is produced as a colourful, HTML file. This file is titled 'results.html' (found in the 'html' folder) and can be opened in a web browser. This file contains all the results in an easy-to-view manner with graphical visualisations. See exemplar HTML output here (Amino acid sequences input) and here (Nucleotide sequences input).

CSV Output

The output table is also presented in the CSV format for programmatic or spreadsheet (i.e. Microsoft Excel) access. See exemplar CSV output here (Amino acid sequences input) and here (Nucleotide sequences input)

Summary CSV

A summary CSV file is a 2 column CSV file providing summary statistics on the GV analysis. See exemplar summary CSV output here (Amino acid sequences input) and here (Nucleotide sequences input)

Terminal Output

A tabular summary of the results is also outputted in the terminal to provide quick feedback on the results. The terminal output can be piped to tools like awk and sed or redirected to a file for further processing.

JSON Output

The output is also produced in JSON. GeneValidator is able to re-generate results for any JSON files (or derived JSON files) with that were previously generated by the program. This means that you are able to use the JSON file in your own analysis pipelines and then use GeneValidator to produce the HTML output for the analysed JSON file. See exemplar JSON output here (Amino acid sequences input) and here (Nucleotide sequences input)

Exemplar JSON output usage

JSON output can be filtered or processed in a variety of ways using standard tools, such as the streamable JSON command line program, or jq. The examples below makes use of jq 1.6 which is bundled with GeneValidator.

# Extract sequences that have an overall score of 100
$ jq '.[] | select(.overall_score == 100)' INPUT_JSON_FILE > OUTPUT_JSON_FILE

# Extract sequences that have an overall score of over 70
$ jq '.[] | select(.overall_score > 70)' INPUT_JSON_FILE > OUTPUT_JSON_FILE

# Extract sequences that have more than 50 hits
$ jq '.[] | select(.no_hits > 50)' INPUT_JSON_FILE > OUTPUT_JSON_FILE

# Sort the JSON based on the overall score (ascending - 0 to 100)
$ jq 'sort_by(.overall_score)' INPUT_JSON_FILE > OUTPUT_JSON_FILE
# Sort the JSON based on the overall score (decending - 100 to 0)
$ jq 'sort_by(- .overall_score)' INPUT_JSON_FILE > OUTPUT_JSON_FILE

# Remove the large graphs objects (note these Graphs objects are required if you wish to pass the json back into GV using the `-j` option - see below)
$ jq --raw-output '[ .[] | del(.validations[].graphs) ]' INPUT_JSON_FILE > OUTPUT_JSON_FILE

The subsetted/sorted JSON file can then be passed back into GeneValidator (using the -j command line argument) to generate the HTML report for the sequences in the JSON file.

genevalidator -j OUTPUT_JSON_FILE

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 34

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗