Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Bioinformatics Paper Wala, J., C. Zhang, M. Meyerson, R. Beroukhim. VariantBam: filtering and profiling of nextgenerational sequencing data using region-specific rules. 2016. Bioinformatics, doi: 10.1093/bioinformatics/btw111

Installation

I have succesfully built on Unix with GCC-4.8+

### if on Broad Institute servers, add GCC-4.9
reuse -q GCC-4.9

############## DOWNLOAD AND INSTALL BOOST ###############
############## (only if not already installed) ##########
git clone --recursive https://github.com/boostorg/boost.git
cd boost
./bootstrap.sh --with-libraries=regex
./b2

############### DOWNLOAD VARIANT BAM ############### 
git clone --recursive https://github.com/jwalabroad/VariantBam.git
cd VariantBam

############### COMPILE AND INSTALL ###############
./configure --with-boost=<path_to_boost>
make

############### QUICK START ############### 
## using the included test BAM (HCC1143)
VariantBam/src/variant test/small.bam -g 'X:1,000,000-1,100,000' -r mapq[10,100] -c counts.tsv -o mini.bam -v

## get help
VariantBam/src/variant --help

############## TL;DR ###################

## extract all reads and their pair-mates that overlap SNP sites within 100 bp
rfile=<BED file, samtools-style string (e.g. "1:1,000,000-1,000,010"), or VCF>
variant <bam> -l $rfile -o mini.bam -P 100 -v

## mask regions (exclude reads and their pair-mates that overlap)
variant <bam> -L $rfile -o mini.bam -v 

## extract high-quality clipped reads
variant <bam> -r 'phred[4,100];clip[5,1000]' -o mini.bam -v

## subsample to max-coverage. BAM must be sorted
variant <bam> -m 100 -o mini.bam -v

Description

VariantBam is a tool to extract/count specific sets of sequencing reads from next-generational sequencing files. To save money, disk space and I/O, one may not want to store an entire BAM on disk. In many cases, it would be more efficient to store only those read-pairs or reads who intersect some region around the variant locations. Alternatively, if your scientific question is focused on only one aspect of the data (e.g. breakpoints), many reads can be removed without losing the information relevant to the problem.

Example Use 1

Whole-genome analysis has been conducted on a BAM, generating VCF and MAF files. Ideally, these regions could be manually inspected or reanalyzed without having to keep the entire BAM. Running VariantBam to extract only reads that overlap these events will allow these regions to be rapidly queried, without having to keep the full BAM record.

### Extract all read PAIRS that interset with a variant from a VCF
variant $bam -l myvcf.vcf -o mini.bam

Example Use 2

In situations where the sequencing or library preparation quality is low, it may be advantageous to remove poor quality reads before starting the analysis train. VariantBam handles this by optionally taking into account Phred base-qualities when making a decision whether to keep a sequencing read. For instance, one might only be interested in high quality MAPQ 0 or clipped reads. VariantBam can be setup to apply unique Phred filters to different regions or across the entire genome, all with one-pass.

### Extract only high quality reads with >= 50 bases of phred >=4 and MAPQ >= 1 and not duplicated/hardclip/qcfail
variant $bam -r 'phred[4,100];length[50,1000];mapq[1,60];!duplicate;!hardclip!;qcfail' -o mini.bam

Example Use 3

An NGS tool operates only on a subset of the reads (eg. structural variant caller using only clipped/discordant reads). Running VariantBam to keep only these reads allows the tool to run much faster. This is particurlaly useful for facilitating a more rapid "build/test" cycle.

### Extract clipped, discordant, unmapped and indel reads
variant $bam  -r '[email protected][0,0];!hardclip;!supplementary;!duplicate;!qcfail;phred[4,100];%[email protected]%discordant[0,1000];mapq[1,1000]%mapq[1,1000];clip[5,1000]%ins[1,1000];mapq[1,100]%del[1,1000];mapq[1,1000]' -o mini.bam

Example Use 4

A user wants to profile a BAM for quality. They would like to count the number of clipped reads in a BAM file, so long as those reads have sufficient optical quality and mapping quality. VariantBam run with the -x flag for "counting only" will accomplish this.

### 
variant $bam -r 'clip[5,100];phred[4,100];mapq[10,100]' -x counts.tsv

Example Use 5

A team is only interested in variants in known cancer genes, and would like to analyze thousands of exomes and genomes. Running VariantBam to extract reads from only these genes, and sending the BAM files to compressed CRAM provides sufficient data reduction to allow all of the relevant data to be stored on disk.

### Grab only reads from predefined regions. Strip unneccessary tags and convert to CRAM for maximum compression
variant $bam -g mygenes.bed -r all -C -o mini.cram -s BI,OQ

Example Use 6

A research team would like to extract only reads matching a certain motifs, but only if they have high optical quality. VariantBam with the motif rule will accomplish this with rapid O(n) efficiency for an arbitrarily large motif dictionary (where n is the length of a read)

### 
variant $bam -r 'motif[mymotifs.txt];phred[4,100];length[20,1000]' -o mini.bam

Example Use 7

To reduce the size of the BAM, reads can be removed from centromeric and satellite repeat regions. These reads are rarely helpful for variant calling. To remove reads that intersect a region, set the region as an inverse-region. In a VariantBam script, use !region or !mlregion. For quick use on the command line, use -L or -G (opposites of -l and -g).

### 
variant $bam -L bad.bed -o mini.bam -v

Example Use 8

Massive read-pileups can occur at repetitive regions. These can reduced with VariantBam by subsampling to a max-coverage.

### BAM must be sorted
variant $bam -m 100 -o mini.bam -v

Example Use 9

A user would like to extract only those reads supporting a particular allele at a variant site. This can be done by combining a small point-region at the variant site with a motif dictionary. Consider two alleles G and A at a site (e.g. 1:143250877), along with their adjacent sequences: GCAGAAT and GCAAAAT. To extract variant reads supporting the A allele:

## make the motifs file (include reverse complements) 
printf "GCAAAAT\nATTTTGC" > motifs.txt
## just look near the variant
k="1:143,250,677-143,251,077" 
r='motif[motifs.txt]'
g=1:143250877
variant <bam> -k $k -g $g -r $r -o mini.bam

Because sequence information is required to match a motif, and reads do not contain the sequence information of their pair-mates, extracting all read pairs supporting a particular allele requires a two-pass solution:

## two pass solution                                                                                                                                                                                                                                                            
variant <bam> -k $k -g $g -r $r | cut -f1 | uniq > q.txt                                                                                                                                                                        
printf "^@\n" >> q.txt ## keep the sam header too                                                                                                                                                                                                                           
samtools view <bam> $k -h | grep -f q.txt | samtools view - -b > mini.bam

This can be expanded for an arbitrary number of heterozygous sites, for instance to capture reads from a single haplotype:

r='[email protected]:143,250,677%motif[motifsA.txt]%[email protected]:183,250,677%motif[motifsB.txt]'
variant <bam> -r $r -o mini.bam

Note that for the allele-specific extraction, there will be false negatives (reads not extracted) if a read has a sequencing error within the motif.

Tool comparison

VariantBam packages into a single executable a number of filtering features not easily found using samtools + awk::

Filter specifically on read clipping, orientation and insert size (all important for structural variation), while taking into account the per-base phred quality

Interval tree to efficiently determine if a read or read mate overlaps a region

Provide different rules for different arbitrarily-sized regions, and to provide these regions as common variant files (VCF, MAF, BED)

Select reads by matching motifs against a large dictionary using Aho-Corasick implementation

Count reads that satisfy any number of user-defined properties

Read and write CRAM files

Selectively strip alignment tags

Support for sub-sampling to obtain a BAM file with a coverage limit

VariantBam is implemented in C++ and uses HTSlib, a highly optimized C library used as the core of Samtools and BCFtools.

To get a full list of options, run variant --help.

Rules Script Syntax

This section will describe the syntax used by VariantBam to specify the cascades of rules and regions applied to a BAM. Below is an example of a valid VariantBam script:

    ### this is a comment. The line code below defines filters to be applied to each region/rule
    [email protected]
    !hardclip;mapped;mapped_mate;isize[0,600];!mapq[10,100]
    !hardclip;mapped;mapped_mate;clip[10,101]

Region

Let's look first at the region tag. The [email protected] keyword marks that what follows is a genomic region, which is either the keyword WG for whole genome, or a VCF, MAF, Callstats or BED file. Regions are treated such that they will include any read who overlaps it, even partially. Optionally, you can specify that your region of interest is a bit bigger than is actually in the file. You can do this by "padding" the regions around the sites. For example:

pad[1000];[email protected]

Alternatively, if supplying a region directly with the -l, -L, -g or -G flag, you can specify a padding with the -P flag. Note that this padding will be applied to every region provided with one of these flags:

variant <in.bam> -P 100 -l myvcf.vcf

Note that the -P flag is applied BEFORE the region, and sets the padding for all subsequent regions (reset with -P 0, or set to different padding) You can also state that the region applies to reads who don't necessarily overlap the region, but their pair-mate does (called "mate-linking"). Note that this only applies to the all target. This is particularly useful for extracting all read PAIRS that cover a variant site.

pad[1000];[email protected]

Note that the syntax is such that you must specify the file immediately after the @.

A region can be inverted, so as to exclude any read that satisfies a rule on that region. In the rules-script, simply prepend the region with a !.

## exclude mapq = 0 reads and their mates in blacklist regions
pad[1000];![email protected]
mapq[0,0]

Rules

Rules are supplied as a list of criteria that a read must satisfy. Note that you can take the complement of a condition by prefixing with a !. For example:

    # do not include hardclipped reads, reads with isize > 600, or reads with mapq between 10 and 100.
    !hardclip;isize[0,600];!mapq[10,100]
    
    # an equivalent specification would be
    !hardclip;mapped;!isize[601,250000000];mapq[0,9]``

VariantBam handles multiple rules in the following way. For each read, VariantBam will cycle through the rules within a region until the read satisfies a rule. When it does, it includes the read in the output and stops checking. The logic for the entire collection of rules is then as follows:

On a given rule line, the read must satisfy ALL conditions (logical AND)

Across different rules, the read nead only satisfy ONE rule (logical OR)

To illustrate this, note that there is a small discrepancy in the first rule of the above. In the BAM format, unmapped reads and reads with unmapped mates are given an insert size of 0. However, in the same rule a condition is described to keep all reads with insert sizes 0-600 inclusive. Recalling the AND logic within a rule, VariantBam will exclude the read, because it fails the mapped criteria.

Below is another example which uses the ability of VariantBam to interpret VCFs and BED files, and apply rules separately to them.

    ### declare that region is a VCF file with pads of 1000 on either side of the variant.
    ### use the "mlregion" keyword to specify that pairs whose mate falls in the region belong to this rule
    [email protected]/home/unix/jwala/myvcf.vcf;pad[1000]
    #### I want to keep all the reads (this the default). Ill be explicit with the "all" keyword
    all
    #### A BED file which gives a list of exons. In here, I just want to keep "variant" reads pairs
    [email protected]/home/unix/jwala/myexonlist.bed 
    ## keep discordant reads
    !isize[0,600];
    ## keep only unmapped reads and their mates
    !mapped;!mapped_mate
    ## or keep if it is hardclipped
    hardclip
    ## keep reads with a mismatch to reference, but with high mapq
    nm[1,101];mapq[30,100]

Global

To reduce redundancy, you can also type a [email protected] rule anywhere in the stack, and it will append that rule to everything below. For example, to exclude hardclipped, duplicate, qcfail and supplementary reads in every region, you would do:

    [email protected]!hardclip;!duplicate;!qcfail;!supplementary
    [email protected]
    !isize[0,600]
    clip[10,101];mapq[1,60]
    [email protected]

which is equivalent to

    [email protected]
    !isize[0,600];!hardclip;!duplicate;!qcfail;!supplementary
    clip[10,101];mapq[1,60];!hardclip;!duplicate;!qcfail;!supplementary
    [email protected]
    !hardclip;!duplicate;!qcfail;!supplementary

The global tag will apply through all of the regions. If you want to reset it for everything, just add [email protected] back onto the stack.

To make things run a little faster, you can set the order so that the more inclusive regions / rules are first. This only applies if there is an overlap among regions. This is because VariantBam will move down the list of regions that apply to this read and stop as soon as it meets an inclusion criteria. I prefer to start with a whole-genome region / rule set, and then add more fine-mapped regions later.

Command Line Script

The usual method of inputing rules is with a VariantBam script as a text file (passed to VariantBam with the -r flag). However, sometimes it is useful to not have to write an intermediate file and just feed rules directly in. In that case, just pass a string literal to the -r, -g, -l flags, and VariantBam will parse directly. -r will append a new rule, -g will append a new region and -l will append a new mate-linke regions. You can separate rule lines with either a new -r flag or with a %. For instance, you might run something like the following:

variant big.bam -o small.bam -r '[email protected]!hardclip' -g WG -r '!isize[0,600];%clip[10,101];mapq[1,60]' -l 'myvcf.vcf'

Note the single quotes so that it is interpreted as a string literal in BASH.

Full list of options

Usage: variant <input.bam> [OPTIONS] 

Description: Filter a BAM/CRAM file according to hierarchical rules

 General options
      --help                           Display this help and exit
  -v, --verbose                        Verbose output
  -c, --counts-file                    File to place read counts per rule / region
  -x, --counts-file-only               Same as -c, but does counting only (no output BAM)
 Output options
  -o, --output-bam                     Output BAM file to write instead of SAM-format stdout
  -C, --cram                           Output file should be in CRAM format
  -T, --reference                      Path to reference. Required for reading/writing CRAM
  -h, --include-header                 When outputting to stdout, include the header.
  -s, --strip-tags                     Remove the specified tags, separated by commas. eg. -s RG,MD
  -S, --strip-all-tags                 Remove all alignment tags
 Filtering options
  -q, --qc-file                        Output a qc file that contains information about BAM
  -m, --max-coverage                   Maximum coverage of output file. BAM must be sorted. Negative values enforce a minimum coverage.
  -g, --region                         Regions (e.g. myvcf.vcf or WG for whole genome) or newline seperated subsequence file.  Applied in same order as -r for multiple
  -G, --exclude-region                 Same as -g, but for region where satisfying a rule EXCLUDES this read. Applied in same order as -r for multiple
  -l, --linked-region                  Same as -g, but turns on mate-linking
  -L, --linked-exclue-region           Same as -l, but for mate-linked region where satisfying this rule EXCLUDES this read.
  -r, --rules                          Script for the rules. If specified multiple times, will be applied in same order as -g
  -k, --proc-regions-file              Samtools-style region string (e.g. 1:1,000,000-2,000,000) or BED file of regions to proess reads from
  -P, --region-pad                     Apply a padding to each region supplied to variantBam with the -l, -L, -g or -G flags

Full list of available rules

    #RULE           #EXAMPLE             #DESCRIPTION OF EXAMPLE / FLAG 
    motif           motif[seqs.txt]  	 File containing substrings that must be present in the sequence.
    ins             ins[5,101]           Number of inserted bases on the reads (from parsed CIGAR string)
    del             del[10,101]          Number of deleted bases relative to reference (from parsed CIGAR string). 
    nm              nm[0,4]              NM tag from BAM (number of mismatches). e.g. must be 0-4 inclusive
    xp              xp[0,4]              Number of supplementary aligments, with XP or XA tag from BAM (hold identity of supplementary alignments)
    isize           isize[100,500]       Insert size, where all insert sizes are converted to positive.
    len             len[80,101]          Length of the read following phred trimming
    clip            clip[0,5]            Number of clipped bases following phred trimming
    nbases          nbases[0,5]          Removed reads that have within this range of N bases.
    phred           phred[4,100]         Range of phred scores that are "quality" 
    duplicate       duplicate            Read must be marked as optical duplicate 
    supp            !supp                Read must be primary alignment
    qcfail          !qcfail              Read must note be marked as QC Fail
    fwd_strand      fwd_strand           Read must be mapped to forward strand
    rev_strand      rev_strand           Read must be mapped to reverse strand
    mate_fwd_strand mate_fwd_strand      Mate of read must be mapped to forward strand
    mate_rev_strand mate_rev_strand      Mate of read must be mapped to reverse strand  
    mapped          !mapped              Read must be unmapped
    mapped_mate     mapped_mate          Mate must be mapped
    subsample       subsample[0.4]       Subsample this region to at a certain rate
    ff              ff                   Read pair must have forward-forward orientation
    rr              rr                   Read pair must have reverse-reverse orientation
    fr              fr                   Read pair must have forward-reverse orientation (proper)
    rf              rf                   Read pair must have reverse-forward orientation
    ic              ic                   Read pair must have inter-chromosomal mapping
    discordant      discordant[100,600]  Shortcut for !isize[100,600] || rr || ff || rf || ic (!discordant gives "proper" pairs)

Longer example rules script

##################################
### filters to apply to every read
##################################
[email protected]!duplicate;!hardclip;!qcfail;!supplementary

#######################################################
## specify rules that apply to reads anywhere in genome
#######################################################
[email protected]

#### Reads with abs(isize) outside of this range are kept. Also reads
## with non FR orientation. They must be both mapped though, and have
## min mapq of 1
discordant[0,1200];mapped;mate_mapped;mapq[1,100]

#### Reads with 1+ secondary alignments (as stored in XA or XP tag) are kept
xp[1,100]

### Reads without N in sequence, and with clip of 5+ bases (where each
### base must have phred of 4+ are kept). Must also be sufficiently
### long and have min mapq of 1
nbases[0,0];clip[5,101];phred[4,100];!length[0,20];mapq[1,100]

### Any read with non-zero mapq and an insertion (by CIGAR) of 1+ are kept
mapq[1,100];ins[1,100]

### Same, but for dels
mapq[1,100];del[1,100]

### any unmapped mate, and sufficient optical quality (>40 bases with
### phred 4+), is kept
!mapped;mate_mapped;phred[4,100];!length[0,40]

### any read with unmapped mate and mapq >= 10 is kept
!mate_mapped;mapped;mapq[10,100]

### can match sequences against a database of motifs. If read has
### sequence motif in the provided dictionary, it is kept
motif[VariantBam/test/motifs.txt]

################################################################
### can take ALL reads that overlap, or mate overlaps, a region. 
### the mlregion tag is for "mate-linked", which means a read passes
### if either it or its mate overlaps regions
#################################################################
[email protected]
all

### can specify MuTect regions with KEEP fields as input
#[email protected]_stats.txt
#all

### can specify VCF regions to keep. Here we also pad the regions by
### 500bp on each side
pad[500];[email protected]/test/test.vcf
all

Attributions

VariantBam is developed and maintained by Jeremiah Wala ([email protected]) -- Rameen Berkoukhim's lab -- Dana Farber Cancer Institute, Boston, MA.

This project was developed in collaboration with the Cancer Genome Analysis team at the Broad Institute. Particular thanks to:

Cheng-Zhong Zhang (Matthew Meyerson Lab)
Marcin Imielinski
Gad Getz
Mara Rosenberg
Esther Rheinbay
Gordon Saksena

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 11

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗