GraphAlignmentViewer:

Visualize the pileup of read alignments to a STR locus (or any path in the sequence graph) for one or more samples.

Description:

Genotyping STR loci is a difficult problem due to multiple reasons including flanks with similar genomic sequences, difficulty in aligning reads sequencing errors and sample contamination. ExpansionHunter is a tool that can automatically predict the genotypes of STR loci based on read alignments. However, due to the difficult nature of the problem, there may be errors in the predicted genotype. GraphAlignmentViewer, is a standalone Python3 script that creates a visualization of all read alignments to an STR locus to enable visual inspection of the genotype predictions made by ExpansionHunter. The script can visualize one or more samples, including trios to compare genotype calls in multiple samples. For each sample the script requires genotypes predicted by ExpansionHunter in BAM format for EH version 3 or YAML format for EH version 2.5.

The image generated by GraphAlignmentViewer displays alignments of reads to a modified reference genome. The modified reference consists of the STR sequence long enough to accommodate alignments of all reads together with the sequence immediately flanking the repeat. The repeat sequence is separated from the flanking sequence by vertical lines. If the STR consists of mutliple repeat units, these are separated by vertical lines as well.

Read alignments are grouped into horizontal panels. Each panel consists of reads that either overlap or completely span the number of repeat units indicated on top of the panel. For example, if a sample contains a short allele consisting of five repeat units then the plot is expected to contain a panel with alignments of reads spanning exactly five units. Repeat alleles that are much larger than the read length correspond to panels depicting multiple reads completely contained within the repeat (analogously to pileups corresponding to copy number variant, the more such reads there are, the longer is the corresponding allele).

Each panel displays the sequence of the modified reference in the top row, followed by the sequence of a read in the panel on each row. Insertions and deletions are denoted by "^" and "-" symbols respectively. If reads from multiple samples are provided, then the reads from each sample are stacked vertically from top to bottom. Finally, GraphAlignmentViewer displays the genotypes reported by ExpansionHunter and the locus structure in the title of the plot.

Note: The current version only supports graph specifications consisting of linear paths and self-loops, but not branching nodes.

Example of a visualization generated by GraphAlignmentViewer:

Usage:

python3 GraphAlignmentViewer.py --help

Use case 1: Single sample

python3 GraphAlignmentViewer.py --variant_catalog VARIANT_CATALOG --read_align READ_ALIGN_FILE [--gt_file GT_FILE]

Use case 2: Comparison of genotypes from multiple samples

python3 GraphAlignmentViewer.py --variant_catalog VARIANT_CATALOG --read_align_list READ_ALIGN_FILE_LIST

Requirements:

Python3
Matplotlib
Pysam
Numpy
PyYAML (if visualizing output from versions older than v3)

Inputs:

Common inputs:

VARIANT_CATALOG: Path to variant catalog JSON file used to run ExpansionHunter

Use case 1: Single sample

READ_ALIGN_FILE: Read alignment file generated be ExpansionHunter. BAM for v3, YAML for v2.5.
GT_FILE (optional): VCF or JSON output of genotype calls from ExpansionHunter

Use case 2: Comparison of genotypes from multiple samples

READ_ALIGN_FILE_LIST: A 2/3 column text CSV file containing the list of EH output for all samples:

Column1: Sample name,
Column2: Read alignment from EH output similar to READ_ALIGN_FILE,
Column3 (optional): VCF or JSON file from EH output

The pileups for the samples are shown in the order reported in the input file.
Note: If the paths to the read alignment BAM/YAML and VCF files are not absolute, they are chosen relative to the location of the sample list file.

Optional Arguments:

Option	Argument	Default	Description
`--file_format`	`FILE_FORMAT`	`v3`	Format of read alignments from EH. [`v3`: BAM, `v2.5`: YAML]
`--locus_id`	`LOCUS_ID`	Plot pileups for all loci	Comma-separated list of locus IDs for which to plot pileup
`--greyscale`	`-`	Nucleotides colored in IGV color scheme	Show nucleotides in greyscale: high quality match - black, low quality match - grey, mismatch - red
`--show_read_names`	`-`	Do not display read names	Display read names next to the read alignment
`--show_insertions`	`-`	Do not display inserted sequences	Display full sequences of insertions
`--output_prefix`	`OUTPUT_PREFIX`	No prefix. Output filename(s): `<CHROM>-<START>-<REPEATUNIT>.alignment.png`(`.pdf`)	Prefix of output file. Output filename(s): `<OUTPUT_PREFIX>_<CHROM>-<START>-<REPEATUNIT>.alignment.png`(`.pdf`) corresponding to the position of the first repeat unit in the node grouping. If node grouping is `NONE` or `ALL`, then position corresponds to the first repeat unit in the locus.
`--output_dir`	`OUTPUT_DIR`	`Current working directory`	Output directory
`--title_prefix`	`TITLE_PREFIX`	""	Prefix text to be appended to title of the plot
`--reference_fasta`	`REFERENCE_FASTA`	Represent flanks with 'N's	Indexed FASTA file for reference sequence
`--node_grouping`	`NODE_GROUPING`	Create a separate image for each repeat unit	Comma-separated list of node indices (left flank=`0`) to group and sort reads by genotype. `NONE`: sort reads only by position, `ALL`: group by all repeat nodes from left to right.
`--region_extension_length`	`REGION_EXTENSION_LENGTH` (`INT`)	`1000`	Size of nodes flanking the region structure used for generating the read alignments
`--region_extension_clip_length`	`REGION_EXTENSION_CLIP_LENGTH` (`INT`)	`20`	Number of basepairs of flanking regions to display. `-1`: Infer from maximum span of reads overlapping the locus.
`--dpi`	`DPI` (`INT`)	`100`	Resolution of output PNG image
`--pdf`	`-`	Output PNG image	Output PDF vector graphics image instead of PNG

Outputs:

The script produces 1 file per repeat unit or NODE_GROUPING: <OUTPUT_DIR>/<CHROM>-<START>-<REPEATUNIT>.alignment.png. If the --pdf flag is set then it produces the file <OUTPUT_DIR>/<CHROM>-<START>-<REPEATUNIT>.alignment.pdf instead.

If the flag --output_prefix is set then the output file name is <OUTPUT_DIR>/<OUTPUT_PREFIX>_<CHROM>-<START>-<REPEATUNIT>.alignment.png or <OUTPUT_DIR>/<OUTPUT_PREFIX>_<CHROM>-<START>-<REPEATUNIT>.alignment.pdf

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Illumina / GraphAlignmentViewer

Programming Languages

GraphAlignmentViewer:

Description:

Usage:

Use case 1: Single sample

Use case 2: Comparison of genotypes from multiple samples

Requirements:

Inputs:

Common inputs:

Use case 1: Single sample

Use case 2: Comparison of genotypes from multiple samples

Optional Arguments:

Outputs: