All Projects → Illumina → Paragraph

Illumina / Paragraph

Licence: apache-2.0
Graph realignment tools for structural variants

Labels

Projects that are alternatives of or similar to Paragraph

vcf2gwas
Python API for comprehensive GWAS analysis using GEMMA
Stars: ✭ 27 (-70.65%)
Mutual labels:  vcf
Vcard
This vCard PHP library can easily parse or generate/export vCards as .vcf
Stars: ✭ 333 (+261.96%)
Mutual labels:  vcf
16gt
Simultaneous detection of SNPs and Indels using a 16-genotype probabilistic model
Stars: ✭ 26 (-71.74%)
Mutual labels:  vcf
Ontologies
Home of the Genomic Feature and Variation Ontology (GFVO)
Stars: ✭ 16 (-82.61%)
Mutual labels:  vcf
Vcfanno
annotate a VCF with other VCFs/BEDs/tabixed files
Stars: ✭ 259 (+181.52%)
Mutual labels:  vcf
Hail
Scalable genomic data analysis.
Stars: ✭ 706 (+667.39%)
Mutual labels:  vcf
jannovar
Annotation of VCF variants with functional impact and from databases (executable+library)
Stars: ✭ 42 (-54.35%)
Mutual labels:  vcf
Svtyper
Bayesian genotyper for structural variants
Stars: ✭ 79 (-14.13%)
Mutual labels:  vcf
Pygeno
Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs
Stars: ✭ 261 (+183.7%)
Mutual labels:  vcf
Tiledb Vcf
Efficient variant-call data storage and retrieval library using the TileDB storage library.
Stars: ✭ 26 (-71.74%)
Mutual labels:  vcf
telegram-json-to-vcf
Convert Telegram Contacts JSON File to VCF File
Stars: ✭ 34 (-63.04%)
Mutual labels:  vcf
VCF-kit
VCF-kit: Assorted utilities for the variant call format
Stars: ✭ 94 (+2.17%)
Mutual labels:  vcf
Helmsman
highly-efficient & lightweight mutation signature matrix aggregation
Stars: ✭ 19 (-79.35%)
Mutual labels:  vcf
TypeTE
Genotyping of segregating mobile elements insertions
Stars: ✭ 15 (-83.7%)
Mutual labels:  vcf
Genozip
Compressor for genomic files (FASTQ, SAM/BAM, VCF, FASTA, GVF, 23andMe...), up to 5x better than gzip and faster too
Stars: ✭ 53 (-42.39%)
Mutual labels:  vcf
SVCollector
Method to optimally select samples for validation and resequencing
Stars: ✭ 20 (-78.26%)
Mutual labels:  vcf
Htslib
C library for high-throughput sequencing data formats
Stars: ✭ 529 (+475%)
Mutual labels:  vcf
Truvari
Structural variant toolkit for VCFs
Stars: ✭ 85 (-7.61%)
Mutual labels:  vcf
Mixerp.net.vcards
vCard Serializer and Parser for C#
Stars: ✭ 56 (-39.13%)
Mutual labels:  vcf
Gvanno
Generic germline variant annotation pipeline
Stars: ✭ 23 (-75%)
Mutual labels:  vcf

Paragraph: a suite of graph-based genotyping tools

Introduction

Accurate genotyping of known variants is a critical for the analysis of whole-genome sequencing data. Paragraph aims to facilitate this by providing an accurate genotyper for Structural Variations with short-read data.

Please reference Paragraph using:

(Second version uploaded at September 24, 2019)

Genotyping data in this paper can be found at paper-data/download-instructions.txt

Installation

Please check doc/Installation.md for system requirements and installation instructions.

Run Paragraph from VCF

Test example

After installation, run multigrmpy.py script from the build/bin directory on an example dataset as follows:

python3 bin/multigrmpy.py -i share/test-data/round-trip-genotyping/candidates.vcf \
                          -m share/test-data/round-trip-genotyping/samples.txt \
                          -r share/test-data/round-trip-genotyping/dummy.fa \
                          -o test \

This runs a simple genotyping example for two test samples.

  • candidates.vcf: this specifies candidate SV events in a vcf format.
  • samples.txt: Manifest that specifies some test BAM files. Tab or comma delimited.
  • dummy.fa a short dummy reference which only contains chr1

The output folder test then contains gzipped json for final genotypes:

$ tree test
test
├── grmpy.log            #  main workflow log file
├── genotypes.vcf.gz     #  Output VCF with individual genotypes
├── genotypes.json.gz    #  More detailed output than genotypes.vcf.gz
├── variants.vcf.gz      #  The input VCF with unique ID from Paragraph
└── variants.json.gz     #  The converted graphs from input VCF (no genotypes)

If successful, the last 3 lines of genotypes.vcf.gz will the same as in expected file.

Input requirements

VCF format

paraGRAPH will independently genotype each entry of the input VCF. You can use either indel-style representation (full REF and ALT allele sequence in 4th and 5th columns) or symbolic alleles, as long as they meet the format requirement of VCF 4.0+.

Currently we support 4 symbolic alleles:

  • <DEL> for deletion
    • Must have END key in INFO field.
  • <INS> for insertion
    • Must have a key in INFO field for insertion sequence (without padding base). The default key is SEQ.
    • For blockwise swap, we strongly recommend using indel-style representation, other than symbolic alleles.
  • <DUP> for duplication
    • Must have END key in INFO field. paraGRAPH assumes the sequence between POS and END being duplicated for one more time in the alternative allele.
  • <INV> for inversion
    • Must have END key in INFO field. paraGRAPH assumes the sequence between POS and END being reverse-complemented in the alternative allele.

Sample Manifest

Must be tab-deliemited.

Required columns:

  • id: Each sample must have a unique ID. The output VCF will include genotypes for all samples in the manifest
  • path: Path to the BAM/CRAM file.
  • depth: Average depth across the genome. Can be calculated with bin/idxdepth (faster than samtools).
  • read length: Average read length (bp) across the genome.

Optional columns:

  • depth sd: Specify standard deviation for genome depth. Used for the normal test of breakpoint read depth. Default is sqrt(5*depth).
  • depth variance: Square of depth sd.
  • sex: Affects chrX and chrY genotyping. Allow "male" or "M", "female" or "F", and "unknown" (quotes shouldn't be included in the manifest). If not specified, the sample will be treated as unknown.

Run time

  • On a 30x HiSeqX sample, Paragraph typically takes 1-2 seconds to genotype a simple SV in confident regions.

  • If the SV is in a low-complexity region with abnormal read pileups, the running time could vary.

  • For efficiency, it is recommended to manually set the "-M" option (maximum allowed read count for a variant) to skip these high-depth regions. We recommend "-M" as 20 times of your mean sample depth.

Population-scale genotyping

To efficiently genotype SVs across a population, we recommend doing single-sample mode as follows:

  • Create a manifest for each single sample
  • Run multigrmpy.py for each manifest. Be sure to set "-M" option for each sample according to its depth.
  • Multithreading (option "-t") is highly recommended for population-scale genotyping
  • Merge all genotypes.vcf.gz to create a big VCF of all samples. You can use either bcftools merge or your custom script.

Run Paragraph on complex variants

For more complicated events (e.g. genotype a deletion together with its nearby SNP), you can provide a custimized JSON to paraGRAPH:

Please follow the pattern in example JSON and make sure all required keys are provided. Here is a visualization of this sample graph.

To obtain graph alignments for this graph (including all reads), run:

bin/paragraph -b <input BAM> \
              -r <reference fasta> \
              -g <input graph JSON> \
              -o <output JSON path> \
              -E 1

To obtain the algnment summary, genotypes of each breakpoint, and the whole graph, run:

bin/grmpy -m <input manifest> \
          -r <reference fasta> \
          -i <input graph JSON> \
          -o <output JSON path> \
          -E 1

If you have multiple events listed in the input JSON, multigrmpy.py can help you to run multiple grmpy jobs together.

Further Information

Documentation

External links

  • The Illumina/Polaris repository gives the short-read sequencing data we used to test our method in population.

License

The LICENSE file contains information about libraries and other tools we use, and license information for these.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].