All Projects → KCCG → mity

KCCG / mity

Licence: MIT license
mity: A highly sensitive mitochondrial variant analysis pipeline for whole genome sequencing data

Projects that are alternatives of or similar to mity

mapping-iterative-assembler
Consensus calling (or "reference assisted assembly"), chiefly of ancient mitochondria
Stars: ✭ 15 (-44.44%)
Mutual labels:  genomics, mitochondria
sample
Performs memory-efficient reservoir sampling on very large input files delimited by newlines
Stars: ✭ 61 (+125.93%)
Mutual labels:  genomics
cerebra
A tool for fast and accurate summarizing of variant calling format (VCF) files
Stars: ✭ 55 (+103.7%)
Mutual labels:  genomics
genipe
Genome-wide imputation pipeline
Stars: ✭ 28 (+3.7%)
Mutual labels:  genomics
aws-genomics-workflows
Genomics Workflows on AWS
Stars: ✭ 131 (+385.19%)
Mutual labels:  genomics
simuG
simuG: a general-purpose genome simulator
Stars: ✭ 68 (+151.85%)
Mutual labels:  genomics
MorphIO
A python and C++ library for reading and writing neuronal morphologies
Stars: ✭ 25 (-7.41%)
Mutual labels:  mitochondria
bigly
a pileup library that embraces the huge
Stars: ✭ 38 (+40.74%)
Mutual labels:  genomics
gawn
Genome Annotation Without Nightmares
Stars: ✭ 35 (+29.63%)
Mutual labels:  genomics
ezancestry
Easy genetic ancestry predictions in Python
Stars: ✭ 38 (+40.74%)
Mutual labels:  genomics
viGEN
viGEN - A bioinformatics pipeline for the exploration of viral RNA in human NGS data
Stars: ✭ 24 (-11.11%)
Mutual labels:  genomics
sequencework
programs and scripts, mainly python, for analyses related to nucleic or protein sequences
Stars: ✭ 22 (-18.52%)
Mutual labels:  genomics
PHIST
Phage-Host Interaction Search Tool
Stars: ✭ 19 (-29.63%)
Mutual labels:  genomics
metaRNA
Find target sites for the miRNAs in genomic sequences
Stars: ✭ 19 (-29.63%)
Mutual labels:  genomics
bac-genomics-scripts
Collection of scripts for bacterial genomics
Stars: ✭ 39 (+44.44%)
Mutual labels:  genomics
GenomicsDB
Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.
Stars: ✭ 77 (+185.19%)
Mutual labels:  genomics
wgd
Python package and CLI for whole-genome duplication related analyses
Stars: ✭ 68 (+151.85%)
Mutual labels:  genomics
bfc
High-performance error correction for Illumina resequencing data
Stars: ✭ 66 (+144.44%)
Mutual labels:  genomics
BACTpipe
BACTpipe: An assembly and annotation pipeline for bacterial genomics
Stars: ✭ 19 (-29.63%)
Mutual labels:  whole-genome-sequencing
snp-sites
Finds SNP sites from a multi-FASTA alignment file
Stars: ✭ 182 (+574.07%)
Mutual labels:  genomics

mity logo

mity

mity is a bioinformatic analysis pipeline designed to call mitochondrial SNV and INDEL variants from Whole Genome Sequencing (WGS) data. mity can:

  • identify very low-heteroplasmy variants, even <1% heteroplasmy when there is sufficient read-depth (eg >1000x)
  • filter out common artefacts that arise from high-depth sequencing
  • easily integrate with existing nuclear DNA analysis pipelines (mity merge)
  • provide an annotated report, designed for clinicians and researchers to interrogate

Usage

mity -h

Dependencies

Installation

Installation instructions via Docker, pip, or manually are available in INSTALL.md

Example Usage

This is an example of calling variants in the Ashkenazim Trio.

mity call

First run mity call on three MT BAMs provided in mity/test_in. CRAM files are supported.

We recommend always using --normalise, or mity report won't work:

mity call \
--prefix ashkenazim \
--out-folder-path test_out \
--region MT:1-500 \
--normalise \
test_in/HG002.hs37d5.2x250.small.MT.RG.bam \
test_in/HG003.hs37d5.2x250.small.MT.RG.bam \
test_in/HG004.hs37d5.2x250.small.MT.RG.bam 

This will create test_out/normalised/ashkenazim.mity.vcf.gz (and tbi file).

or, if using Docker:

docker run -w "$PWD" -v "$PWD":"$PWD" drmjc/mity call \
--prefix ashkenazim \
--out-folder-path test_out \
--region MT:1-500 \
--normalise \
test_in/HG002.hs37d5.2x250.small.MT.RG.bam \
test_in/HG003.hs37d5.2x250.small.MT.RG.bam \
test_in/HG004.hs37d5.2x250.small.MT.RG.bam 

mity report

We can create a mity report on the normalised VCF:

mity report \
--prefix ashkenazim \
--min_vaf 0.01 \
--out-folder-path test_out \
test_out/ashkenazim.mity.vcf.gz

This will create: test_out/ashkenazim.annotated_variants.csv and test_out/ashkenazim.annotated_variants.xlsx.

mity normalise

High-depth sequencing and sensitive variant calling can create many variants with more than 2 alleles, and in some cases, joins two nearby variants separated by shared REF sequence into a multi-nucleotide polymorphism as discussed in the manuscript. Here, variant normalisation relates to decomposing the multi-allelic variants and where possible, splitting multi-nucleotide polymorphisms into their cognate smaller variants. At the time of writing, all variant decomposition tools we used failed to propagate the metadata in a multi-allelic variant to the split variants which caused problems when reporting the quality scores associated with each variant.

Technically you can run mity call and mity normalise separately, but since mity report requires a normalised vcf file, we recommend running mity call --normalise.

mity merge

You can merge a nuclear vcf.gz file and a mity.vcf.gz file thereby replacing the MT calls from the nuclear VCF ( presumably from a caller like HaplotypeCaller which is not able to sensitively call mitochondrial variants) with the calls from mity.

mity merge \
--prefix ashkenazim \
--mity_vcf test_out/ashkenazim.mity.vcf.gz \
--nuclear_vcf todo-create-example-nuclear.vcf.gz

Recommendations for interpreting the report

Assuming that you are looking for a pathogenic variant underlying a patient with a rare genetic disorder potentially caused by a Mitochondrial mutation, then we recommend the following strategy:

  1. tier 1 or 2 variants included in the 'commercial_panels' column
  2. tier 1 or 2 variants that match the clinical presentation and the phenotype in 'disease_mitomap', preferably those that are annotated with Confirmed evidence in the 'status_mitomap' column
  3. exclude common variants: anything linked to 'phylotree_haplotype', high 'phylotree_haplotype', high 'MGRB_frequency', high 'GenBank_frequency_mitomap'.
  4. consider any remaining tier 1 or 2 variants that may have a predicted impact on tRNA
  5. consider any remaining variants with high numbers of 'variant_references_mitomap'
  6. if you have analysed multiple family members, consider variants who's level of 'variant_heteroplasmy' match the disease burden
  7. tier 3 variants have low numbers of supporting reads, and should be considered with caution. However we have observed numerous tier 3 variants, especially in WGS from blood, that match the pathogenic allele known to be at much higher heteroplasmy in the affected tissue (this phenomenon is well established in the literature). Thus, if there are any tier 3 variants identified that match the patient's clinical presentation, then we recommend considering these as candidate variants and validating using an orthogonal clinically validated assay, preferably on the disease affected tissue.

Reference genomes

Human

mity natively supports the analysis of the revised Cambridge Reference Sequence (rCRS, RefSeq ID NC_012920.1). The rCRS used in most human reference genomes from NCBI (GRCh37, hs37d5, GRCh38) and hg38 from UCSC, where it is either named chrM, or MT. The main exception in common use is the hg19 reference genome from UCSC, which used a different sequence (RefSeq NC_001807) which differs in length by 2bp, and sharing 99% sequence homology (16530/16572 identities) and 4 gaps. For now, mity call supports the hg19 reference, but mity report will not annotate variants properly, so you should not use this part of the pipeline. We strongly recommend that for mitochondrial analysis, to use a reference genome that uses the rCRS sequence.

  • the mitochondrial genome: since the release of the UCSC hg19 assembly, the Homo sapiens mitochondrion sequence (represented as "chrM" in the Genome Browser) has been replaced in GenBank with the record NC_012920, the revised Cambridge Reference Sequence (rCRS). We have not replaced the original sequence, NC_001807, as chrM in the hg19 Genome Browser. However, files in the subdirectory p13.plusMT include NC_012920 as "chrMT", in addition to the original "chrM".
Reference contig name RefSeq ID length rCRS
GRCh37 chrM NC_012920.1 16569 bp rCRS
hs37d5 MT NC_012920.1 16569 bp rCRS
hg19 (UCSC) chrM NC_001807.4 16571 bp no
GRCh38 chrM NC_012920.1 16569 bp rCRS

Mouse

mity call and normalise support the analysis of the mouse genome (mity call --reference mm10 ...). mity report currently only supports variant annotation to the human rCRS sequence.

Commonly asked Questions

Base quality score recalibration (BQSR)

Most of the development of mity was tested on BAM files that had undergone GATK's BQSR method, which improves the base qualities of each read. In our experience, this reduced the quality score of most bases by ~10 points, indicating that the base qualities straight out of the sequencer are generally inflated. As the GATK best practices guide no longer recommends BQSR, it's reasonable to ask whether mity can be run on BAM files straight out of the aligner. mity has a custom QUAL score, which depends on the base qualities of only the reads that support the alternative allele.
For tier 1 or 2 variants, there will be so many supporting reads, that any miscalibration of base quality scores will have no material effect. Tier 3 variants with very few supporting reads may be impacted, where a variant with only 3 or 4 supporting reads may end up having a stronger mity QUAL score than after BQSR. The comment above regarding how you should interpret and validate tier 3 variant still holds. We would appreciate any feedback you may have on this.

CRAM support

CRAM support was added to mity call in v0.4.0.

Acknowledgements

We would like to thank:

  • The Kinghorn Centre for Clinical Genomics and collaborators, who helped with feedback for running mity.
  • The Genome in a Bottle consortium for providing the test data used here
  • Eric Talevich who's CNVkit helped us structure mity as a package
  • Erik Garrison for developing FreeBayes and his early feedback in optimising FreeBayes for sensitive variant detection.
  • Brent Pederson for developing gsort
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].