All Projects → brentp → echtvar

brentp / echtvar

Licence: MIT license
echt rapid variant annotation and filtering

Programming Languages

rust
11053 projects
python
139335 projects - #7 most used programming language
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to echtvar

open-cravat
A modular annotation tool for genomic variants
Stars: ✭ 74 (+2.78%)
Mutual labels:  genomics, variant-annotations, variant-analysis
simuG
simuG: a general-purpose genome simulator
Stars: ✭ 68 (-5.56%)
Mutual labels:  genomics, variant-analysis
phenomenet-vp
A phenotype-based tool for variant prioritization in WES and WGS data
Stars: ✭ 31 (-56.94%)
Mutual labels:  variant-annotations, variant-analysis
interARTIC
InterARTIC - An interactive local web application for viral whole genome sequencing utilising the artic network pipelines..
Stars: ✭ 22 (-69.44%)
Mutual labels:  genomics
barque
Environmental DNA metabarcoding taxonomic identification
Stars: ✭ 14 (-80.56%)
Mutual labels:  genomics
unimap
A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
Stars: ✭ 76 (+5.56%)
Mutual labels:  genomics
wrangling-genomics
Data Wrangling and Processing for Genomics
Stars: ✭ 49 (-31.94%)
Mutual labels:  genomics
souporcell
Clustering scRNAseq by genotypes
Stars: ✭ 88 (+22.22%)
Mutual labels:  genomics
shiny-iatlas
An interactive web portal for exploring immuno-oncology data
Stars: ✭ 43 (-40.28%)
Mutual labels:  genomics
Scaff10X
Pipeline for scaffolding and breaking a genome assembly using 10x genomics linked-reads
Stars: ✭ 21 (-70.83%)
Mutual labels:  genomics
faster lmm d
A faster lmm for GWAS. Supports GPU backend.
Stars: ✭ 12 (-83.33%)
Mutual labels:  genomics
perf
PERF is an Exhaustive Repeat Finder
Stars: ✭ 26 (-63.89%)
Mutual labels:  genomics
jgi-query
A simple command-line tool to download data from Joint Genome Institute databases
Stars: ✭ 38 (-47.22%)
Mutual labels:  genomics
dysgu
dysgu-SV is a collection of tools for calling structural variants using short or long reads
Stars: ✭ 47 (-34.72%)
Mutual labels:  genomics
hts-python
pythonic wrapper for htslib
Stars: ✭ 18 (-75%)
Mutual labels:  genomics
wdlRunR
Elastic, reproducible, and reusable genomic data science tools from R backed by cloud resources
Stars: ✭ 34 (-52.78%)
Mutual labels:  genomics
vcf2tsv
Genomic VCF to tab-separated values
Stars: ✭ 27 (-62.5%)
Mutual labels:  variant-annotations
bio-pipeline
My collection of light bioinformatics analysis pipelines for specific tasks
Stars: ✭ 60 (-16.67%)
Mutual labels:  genomics
wgs2ncbi
Toolkit for preparing genomes for submission to NCBI
Stars: ✭ 25 (-65.28%)
Mutual labels:  genomics
MTBseq source
MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.
Stars: ✭ 26 (-63.89%)
Mutual labels:  genomics

Echtvar: Really, truly rapid variant annotation and filtering

Rust

Echtvar enables rapid annotation of variants with huge pupulation datasets and it supports filtering on those values. It chunks the genome into 1<<20 (~1 million ) bases, encodes each variant into a 32 bit integer (with a supplemental table for those that can't fit due to large REF and/or ALT alleles). It uses the zip format, delta encoding and integer compression to create a compact and searchable format of any integer, float, or low-cardinality string columns selected from the population file.

Once created, an echtvar (zip) file can be used to annotate variants in a VCF (or BCF) file at a rate of >1 million variants per second (most of the time is spent reading and writing VCF/BCF, so this number depends on the particular file).

A filter expression can be applied so that only variants that meet the expression are written. Since echtvar is so fast, writing the output is a bottleneck so filtering can actually increase the speed.

read more at the why of echtvar

Getting started.

Get a static binary and pre-encoded echtvar files for gnomad v3.1.2 (hg38) here: https://github.com/brentp/echtvar/releases/latest That page contains exact instructions to get started with the static binary.

⬇️Download instructions for linux

The linux binary is available via:

wget -O ~/bin/echtvar https://github.com/brentp/echtvar/releases/latest/download/echtvar \
    && chmod +x ~/bin/echtvar \
    && ~/bin/echtvar

Users can make their own echtvar archives with echtvar encode, and pre-made archives for gnomAD version 3.1.2 are here

Rust users can build on linux with:

cargo build --release --target x86_64-unknown-linux-gnu

usage

encode

make (encode) a new echtvar file. this is usually done once (or download from those provided in the Release pages) and then the file can be re-used for the annotation (echtvar anno) step with each new query file. Note that input VCFs must be decomposed.

echtvar \
   encode \
   gnomad.v3.1.2.echtvar.zip \
   conf.json # this defines the columns to pull from $input_vcf, and how to
   $input_population_vcf[s] \ can be split by chromosome or all in a single file.
name and encode them

See below for a description of the json file that defines which columns are pulled from the population VCF.

annotate

annotate a decomposed (and normalized) VCF with an echtvar file and only output variants where gnomad_af from the echtvar file is < 0.01. Note that multiple echtvar files can be specified and the -i expression is optional and can be elided to output all variants.

echtvar anno \
   -e gnomad.v3.1.2.echtvar.v2.zip \
   -e dbsnp.echtvar.zip \
   -i 'gnomad_popmax_af < 0.01' \
   $cohort.input.bcf \
   $cohort.echtvar-annotated.filtered.bcf

Configuration File for Encode

When running echtvar encode, a json5 (json with comments and other nice features) determines which columns are pulled from the input VCF and how they are stored.

A simple example is to pull a single integer field and give it a new name (alias):

[{"field": "AC", "alias": "gnomad_AC"}]

This will extract the "AC" field from the INFO and labeled as "gnomad_AC" when later used to annotate a VCF. Note that it's important to give a description/unique prefix lke "gnomad_" so as not to collide with fields already in the query VCF.

⬇️Expand this section for detail on additional fields, including float and string types
[
    {"field": "AC", "alias": "gnomad_AC"},
    // this JSON file is json 5 and so can have comments
    // the missing value will default to -1, but the value: -2147483648 will
    // result in '.' as it is the missing value for VCF.
    {"field": "AN", "alias":, gnomad_AN", missing_value: -2147483648},
    {
           field: "AF",
           alias: "gnomad_AF",
           missing_value: -1,
           // since all values (including floats) are stored as integers, echtvar internally converts
           // any float to an integer by multiplying by `multiplier`.
           // higher values give better precision and worse compression.
           // upon annotation, the score is divided by multiplier to give a number close to the original float.
           multiplier: 2000000,
   }
    // echtvar will save strings as integers along with a lookup. this can work for fields with a low cardinality.
    {"field": "string_field", "alias":, gnomad_string_field", missing_string: "UNKNOWN"},
    // "FILTER" is a special case that indicates that echtvar should extract the FILTER column from the annotation vcf.
    {"field": "FILTER", "alias": "gnomad_filter"},
]

The above file will extract 5 fields, but the user can chooose as many as they like when encoding. All fields in an echtvar file will be added (with the given alias) to any VCF it is used to annotate.

Other examples are available here

Expressions

An optional expression will determine which variants are written. It can utilize any (and only) integer or float fields present in the echtvar file (not those present in the query VCF). An example could be:

-i 'gnomad_af < 0.01 && gnomad_nhomalts < 10'

The expressions are enabled by fasteval with supported syntax detailed here.

In brief, the normal operators: (&&, ||, +, -, *, /, <, <=, >, >= and groupings (, ), etc) are supported and can be used to craft an expression that returns true or false as above.

References and Acknowledgements

Without these (and other) critical libraries, echtvar would not exist.

echtvar is developed in the Jeroen De Ridder lab

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].