All Projects → quinlan-lab → pathoscore

quinlan-lab / pathoscore

Licence: MIT license
pathoscore evaluates variant pathogenicity tools and scores.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
HTML
75241 projects

Projects that are alternatives of or similar to pathoscore

frc-score-detection
A program to detect FRC match scores from their livestream.
Stars: ✭ 15 (-28.57%)
Mutual labels:  score
pepper
PEPPER-Margin-DeepVariant
Stars: ✭ 179 (+752.38%)
Mutual labels:  variants
phenomenet-vp
A phenotype-based tool for variant prioritization in WES and WGS data
Stars: ✭ 31 (+47.62%)
Mutual labels:  variants
styled-variants
A scalable styled-component theming system that fully leverages JavaScript as a language for styles authoring and theming at both local and global levels.
Stars: ✭ 19 (-9.52%)
Mutual labels:  variants
pulsar-core
🚀 Handy dynamic styles utilities for React Native and React Native Web.
Stars: ✭ 27 (+28.57%)
Mutual labels:  variants
civic-server
Backend Server for CIViC Project
Stars: ✭ 39 (+85.71%)
Mutual labels:  variants
Unleash
Unleash is the open source feature toggle service.
Stars: ✭ 4,679 (+22180.95%)
Mutual labels:  variants
spark-vcf
Spark VCF data source implementation for Dataframes
Stars: ✭ 15 (-28.57%)
Mutual labels:  variants
Harbol
Harbol is a collection of data structure and miscellaneous libraries, similar in nature to C++'s Boost, STL, and GNOME's GLib
Stars: ✭ 18 (-14.29%)
Mutual labels:  variants
VariantRetriever
VariantRetriever is a minimalist package for feature flagging
Stars: ✭ 23 (+9.52%)
Mutual labels:  variants
rvtests
Rare variant test software for next generation sequencing data
Stars: ✭ 114 (+442.86%)
Mutual labels:  variants
cacao
Callable Cancer Loci - assessment of sequencing coverage for actionable and pathogenic loci in cancer
Stars: ✭ 21 (+0%)
Mutual labels:  pathogenic-variants
vcf stuff
📊Evaluating, filtering, comparing, and visualising VCF
Stars: ✭ 19 (-9.52%)
Mutual labels:  variants
HLA
xHLA: Fast and accurate HLA typing from short read sequence data
Stars: ✭ 84 (+300%)
Mutual labels:  variants
python-blobopera
Toolkit to convert MusicXML files into Blob Opera scores with real lyrics.
Stars: ✭ 24 (+14.29%)
Mutual labels:  score
indigo
Indigo: SNV and InDel Discovery in Chromatogram traces obtained from Sanger sequencing of PCR products
Stars: ✭ 26 (+23.81%)
Mutual labels:  variants
CuteVCF
simple viewer for variant call format using htslib
Stars: ✭ 30 (+42.86%)
Mutual labels:  variants
variantkey
Numerical Encoding for Human Genetic Variants
Stars: ✭ 32 (+52.38%)
Mutual labels:  variants
ordo
Ordo: A minimalist language with row polymorphism
Stars: ✭ 50 (+138.1%)
Mutual labels:  variants
needlestack
Multi-sample somatic variant caller
Stars: ✭ 45 (+114.29%)
Mutual labels:  variants

pathoscore

pathoscore evaluates variant pathogenicity tools and scores.

evaluating scores is hard because logic can be circular and benign and pathogenic sets are hard to curate and evaluate.

pathoscore is software and datasets that facilitate applying evaluating pathogenicity scores.

The sections below describe the tools.

Annotate

Annotate a vcf with some scores (which can be bed or vcf). Note that this tool is a simple wrapper around vcfanno so a user can instead use to run vcfanno directly.

python pathoscore.py annotate \
    --scores exac-ccrs.bed.gz:exac_ccr:14:max \
    --scores mpc.regions.clean.sorted.bed.gz:mpc_regions:5:max \
    --exclude /data/gemini_install/data/gemini_data/ExAC.r0.3.sites.vep.tidy.vcf.gz \
    --conf combined-score.conf \
    testing-denovos.vcf.gz

The individual flags are described here:

scores

The scores format is path:name:column:op where:

  • name becomes the new name in the INFO field.

  • column indicates the column number (or INFO name) to pull from the scores VCF.

  • op is a vcfanno operation.

  • multiple annotations for the same file can be used as such:

python pathoscore.py annotate --prefix benign \
 --scores score-sets/GRCh37/aloft/aloft.txt.gz:aloft_het,aloft_lof,aloft_rec:5,6,7:max,max,max \
 truth-sets/GRCh37/clinvar/clinvar-benign.20170905.vcf.gz

exclude

can be a population VCF that is used to filter would-be pathogenic variants (as we know that common variants can't be pathogenic). This can also be a set of regions to exclude, and for user convenience we curated gene sets that the user can filter on such as autosomal dominant genes from Berg et al. (2013) and haploinsufficient genes from Dang et al. (2008).

conf

an optional vcfanno conf file so users can specify exactly how to annotate if they feel comfortable doing so.

This can also be used to specify vcfanno [[postannotation]] blocks, for example, to combine scores.

An example conf to combine 2 scores looks like:

[[postannotation]]
name="combined"
op="lua:exac_ccr+10\*cadd"
fields=["exac_ccr", "cadd"]
type="Float"

Evaluate

python pathoscore.py evaluate \
    -s MPC \
    -s exac_ccr \
    -i mpc_regions \
    -s combined \
    --goi listofgenesofinterest \
    pathogenic.vcf.gz \
    benign.vcf.gz

This will take the output(s) from annotate and create ROC curves and score distribution plots. It assumes that the first VCF contains pathogenic variants and the 2nd contains benign variants. It uses the columns specified via -s and -i as the scores.

-i indicates that lower scores are more constrained where as

-s is for fields where higher scores are more constrained.

--goi is to provide a newline delimited file of genes of interest for a clinical utility calculation. More information is provided in the wiki.

Output

An example ROC curve for the Clinvar truth-set looks like this:

roc

The point in the plot shows the max J Statistic which can be summarized as the point in each curve where the vertical distance to the Y=X line is maximized. This has its highest possible value at an FPR of 0 so there is an implicit penalty for having a high TPR at a high-ish FPR.

We also report the full distrubtion of J statistics:

J

finally, we report the proportion of benign and pathogenic variants scored in a truth-set:

scores

These plots, along with the score-distributions for each method for pathogenic and benign, are aggregated into a single HTML report.

Install

Download a vcfanno binary for your system and make it available as vcfanno on your $PATH

Then run:

pip install -r requirements.txt

Then you should be able to run the evaluation scripts.

Truth Sets

Part of pathoscore is to provide curated truth sets that can be used for evaluation.

These are kept in truth-sets/. Each set has a benign and/or a pathogenic set.

Pull-requests for recipes that add new truth sets are welcomed. These should include a make.sh script that, when run will pull from the original data source and make a benign and/or pathogenic vcf that is bgzipped and tabixed and made as small as possible (see the clinvar example for how to remove unneeded fields from the INFO field).

All truth-sets should be annotated with bcftools csq so that it's possible to choose to score only functional variants.

Currently we have:

ClinVar

  • ClinVar pathogenics are either Pathogenic or Likely-Pathogenic and variants with uncertainty are removed.
  • ClinVar benigns are either Benign or Likely-Benign and variants with uncertainty are removed.
  • ClinVar variants where there is an SSR field are removed because they are suspected false positives due to paralogy or computational/sequencing error
  • We created a version of ClinVar benigns that incorporates gnomAD variants to match the much larger count of ClinVar pathogenics, with the intent of creating an equal-sized set of pathogenics and benigns. See the README for those sets for more info.

Samocha

These are from Kaitlin Samocha's paper on mis-sense contraint.

  • Benigns are labelled as control in her source file.
  • Pathogenics are anything other than control.

Filtering Pathogenic Variants on Allele Frequency

Some alleged pathogenic variants may appear at high allele frequencies in population databases, and some users may understandably find those variants suspect. If you would like to filter out variants on allele frequency in a population set. An example conf file is provided in the repo called af.conf. If you have additional filtering parameters you'd like to specify you can also use a conf file for that as detailed in vcfanno's repo.

And then you can run the pathoscore script as below:

python pathoscore.py annotate --scores score-sets/GRCh37/MPC/mpc.txt.gz:MPC:5:max --scores score-sets/GRCh37/REVEL/revel.txt.gz:REVEL:7:max truth-sets/GRCh37/samocha/samocha.pathogenic.vcf.gz --prefix neurodev --conf af.conf

Just make sure that you don't use a file more than once in the conf file, write everything you want to do for each file in a list as shown above. Additionally, don't use any fields like --scores or --exclude to perform things on a file that is already referenced in the conf file you provide to pathoscore. It will not work.

For user convenience, under scripts/gnomad, there are make scripts for generating vt normalized, decomposed and BCSQ annotated ExAC v1 and gnomAD VCF files, so that you can filter by allele frequency in those population datasets.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].