All Projects → brentp → rare-disease-wf

brentp / rare-disease-wf

Licence: MIT License
(WIP) best-practices workflow for rare disease

Programming Languages

Nextflow
61 projects
HTML
75241 projects
SystemVerilog
227 projects
Dockerfile
14818 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to rare-disease-wf

spark-vcf
Spark VCF data source implementation for Dataframes
Stars: ✭ 15 (-68.09%)
Mutual labels:  genomics, variants
HLA
xHLA: Fast and accurate HLA typing from short read sequence data
Stars: ✭ 84 (+78.72%)
Mutual labels:  genomics, variants
variantkey
Numerical Encoding for Human Genetic Variants
Stars: ✭ 32 (-31.91%)
Mutual labels:  genomics, variants
CuteVCF
simple viewer for variant call format using htslib
Stars: ✭ 30 (-36.17%)
Mutual labels:  genomics, variants
MTBseq source
MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.
Stars: ✭ 26 (-44.68%)
Mutual labels:  genomics, variants
wgs2ncbi
Toolkit for preparing genomes for submission to NCBI
Stars: ✭ 25 (-46.81%)
Mutual labels:  genomics
simplesam
Simple pure Python SAM parser and objects for working with SAM records
Stars: ✭ 50 (+6.38%)
Mutual labels:  genomics
faster lmm d
A faster lmm for GWAS. Supports GPU backend.
Stars: ✭ 12 (-74.47%)
Mutual labels:  genomics
bio-pipeline
My collection of light bioinformatics analysis pipelines for specific tasks
Stars: ✭ 60 (+27.66%)
Mutual labels:  genomics
catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (+17.02%)
Mutual labels:  genomics
ntHash
Fast hash function for DNA sequences
Stars: ✭ 66 (+40.43%)
Mutual labels:  genomics
shiny-iatlas
An interactive web portal for exploring immuno-oncology data
Stars: ✭ 43 (-8.51%)
Mutual labels:  genomics
unimap
A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
Stars: ✭ 76 (+61.7%)
Mutual labels:  genomics
wrangling-genomics
Data Wrangling and Processing for Genomics
Stars: ✭ 49 (+4.26%)
Mutual labels:  genomics
Scaff10X
Pipeline for scaffolding and breaking a genome assembly using 10x genomics linked-reads
Stars: ✭ 21 (-55.32%)
Mutual labels:  genomics
saffrontree
SaffronTree: Reference free rapid phylogenetic tree construction from raw read data
Stars: ✭ 17 (-63.83%)
Mutual labels:  genomics
jgi-query
A simple command-line tool to download data from Joint Genome Institute databases
Stars: ✭ 38 (-19.15%)
Mutual labels:  genomics
perf
PERF is an Exhaustive Repeat Finder
Stars: ✭ 26 (-44.68%)
Mutual labels:  genomics
echtvar
echt rapid variant annotation and filtering
Stars: ✭ 72 (+53.19%)
Mutual labels:  genomics
enformer-pytorch
Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch
Stars: ✭ 146 (+210.64%)
Mutual labels:  genomics

For rare-disease, the best practices and expected number of candidate variants for each inheritance mode are known. The actual filtering is easily done with a tool like slivar. This is a necessary first step with the following limitations:

  1. it leaves an analyst or clinician with choices on how to prioritize the 10-15 candidates variants or ~100 for autosomal (non de novo) dominant.
    • This is quite a small number, but the prioritization after this is highly variable across tools and analysts.
  2. it is limited text/spreadsheet output
  3. it assumes a high-quality, jointly-called VCF is already available
  4. it leaves the analyst with the chore of getting IGV set up, and browsing each candidate for each family.

Quickstart

Note, it is early days for the project. It will produce high-quality SNP/indel candidates but you may need experience with nextflow to run it easily.

This project currently has workflow that can be run as:

# NOTE that you need to remove everything after \ on each line for the command to work
# the comments here are just for documentation purposes.
nextflow run -resume -profile slurm rare-disease.nf \
    -config nextflow.config \    # a starting config is included in this repo. adjust from there.
    --xams "/path/to/*/*.cram" \ # NOTE that this is a string glob
    --ped $pedigree_file \       # see: https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format
    --fasta $reference_fasta \
    --gff $gff \                   # e.g. from: ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/
    --slivarzip gnomad.hg38.zip  \  # from: https://github.com/brentp/slivar#gnotation-files
    --cohort_name my_rare_disease

Output

See this wiki page for more information about how to use the output.

This does:

  1. Run DeepVariant and GLNexus (we have shown these tools to give higher quality results for trios) in an efficient nextflow workflow that can be easily run in the cloud or on a cluster.
  2. Decompose and normalize variants.
  3. Annotate with bcftools csq and snpEff
  4. Annotate with allele frequency and inheritance modes using slivar
  5. Annotate with gene-based annotations:
    • clinvar-gene-phenotype
    • loss-of-function intolerance
  6. Output high-quality calls from slivar for recessive, dominant, x-linked, compound-het and other inheritance modes.
  7. Generates and links pre-made, standalone igv.js/jigv outputs for each candidate.

And the key output will be in: results-rare-disease/${cohort_name}.slivar.candidates.tsv which is something one can easily view in excel or other spreadsheet software. In addition, it will create: results-rare-disease/${cohort_name}.jigv.html and results-rare-disease/jigv_plots/* which together provide an HTML table and interactive igv.js views of each variant and associated alignments that do not rely on the original alignment files.

In coming releases, this will:

  1. Output QC with somalier and other tools to be shown in multiQC
  2. Output high-quality SVs (using manta-> graphtyper)

Octopus

currently, octopus is included as a separate workflow. This octopus.nf pipeline will detect trios and families and run them together and then iteratively merge across families using the n+1 schema described in the octopus docs Finally, the workflow will do the forest filtering as recommended by the octopus documentation. We plan to integrate the octopus and deepvariant calls in the future.

Future Development

Development and research is underway so that it will:

  1. Add a high-quality set of SV/CNVs
  2. Add some prioritization of variants
    • For example, lower priority to variants filtered in gnomAD
  3. Integrate SV/CNV calls with the snp/indels to find, for example compound heterozygotes with a snp:SV pair.
  4. Evaluate use of octopus to find large indels (and/or SNPs and indels).
  5. Use GTex + phenotypes to further prioritize variants in a family and phenotype-specific way, such that, for example variants in genes that are not expressed in relevant tissues are down-weighted.
  6. Provide a graphical-user-interface so that sorting, filtering, note-taking, sharing is simplified

Software Used

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].