Bystro

Bystro Publication

For datasets and scripts used, please visit github.com/bystro-paper

If using Bystro, please cite Kotlar et al, Genome Biology, 2018

Web Tutorial

For most users, we recommend https://bystro.io .

The web app gives full access to all of Bystro's capabilities, provides a convenient search/filtering interface, supports large data sets (tested up to 890GB uncompressed/129GB compressed), and has excellent performance.

Installing Bystro

Check out the master branch for the upcoming release

The easiest way is to run from Docker: docker pull akotlar/bystro:latest && docker run bystro:latest bystro-annotate.pl

Please read: INSTALL.md for instructions on how to download and use Bystro hg19/hg38/etc databases.

Bystro relies on pluggable (via Bystro's YAML config) pre-processors to normalize variant inputs (dealing with VCF issues such as padding), calculate whether a site is a transition or transversion, calculate sample maf, identify hets/homozygotes/missing samples, calculate heterozygosity, homozygosity, missingness, and more.

VCF format: Bystro-Vcf
SNP format: Bystro-SNP
Create your own to support other formats!

Annotation (Output) Field Descriptions

Please read FIELDS.md

The Bystro configuration file

The config file describes the state of both the database and the annotation. It's required for annotating or building
It has several keys:
- tracks: The highest level organization for database values. Tracks have a name property, which must be unique, and a type, which must be one of:
  - sparse: Any bed file, or any file that can be mapped to chrom, chromStart, and chromEnd columns.
    - This is used for dbSNP, and Clinvar records, but many files can be fit this format.
    - Mapping fields can be managed by the fieldMap key
  - score: Accepts any wigFix file.
    - Used for phastCons, phyloP
  - cadd:
    - Accepts any CADD file, or Bystro's custom "bed-like" CADD file, which has 2 header lines, and chrom, chromStart, chromEnd columns, followed by standard CADD fields
    - CADD format: http://cadd.gs.washington.edu
  - gene: A UCSC gene track field (ex: knownGene, refGene, sgdGene).
    - The local_files for this are created using an sql_statement
    - Ex: SELECT * FROM hg38.refGene LEFT JOIN hg38.kgXref ON hg38.kgXref.refseq = hg38.refGene.name
- chromosomes: The allowable chromosomes.
  - Each row of every track must be identified by these chromosomes (during building)
  - Each row of any input file submitted for annotation must also be "" "" (during annotation)
  - However, Bystro is flexible about the chr prefix
  Ex: For the following config
```
chromosomes:
- chr1
- chr2
- chr3
```
  Only chr1, chr2, and chr3 will be accepted. However, Bystro tries to make your life easy
  1. We currently follow UCSC conventions for chromosomes, meaning they should be prepended by chr
  2. Bystro will automatically append chr to chromosomes read from an input file during annotation.
  3. Bystro allows the transformation of any field during building, configurable in the YAML config file for that assembly, making it easy to prepend chr to the source file chromosome field
  Ex: Clinvar doesn't have a chr prefix, so during building we specify:
```
tracks:
  - name: clinvar
    build_field_transformations:
      chrom: chr .
    fieldMap:
      Chromosome: chrom
```
  Here fieldMap allows us to rename header fields, and build_field_transformations allows us to define a prepend operation (chr . can be interpreted as the perl command "chr" . $chrom)
  
  So: input files do not need to have their chromosomes prepended by chr. Bystro will normalize the name.
  
  In this example chromosomes 1 and chr1 will be built/annotated, but 1_rand will not.

Directories and Files

These describe where the Bystro database and any source files are located.

files_dir : The parent folder within which each track's local_files are located

Bystro automatically checks for local_files at parent/trackName/file

Ex: For the config file containing
```
files_dir: /path/to/files/
track:
  - name: refSeq
    local_files:
      - hg19.refGene.chr1.gz
      # and more files
```
Bystro will expect files in /path/to/files/refSeq/hg19.refGene.chr1.gz

database_dir : Each database is held within database_dir, in a folder of the name assembly

Ex: For the config file containing
```
assembly: hg19
database_dir: /path/to/databases/
```
Bystro will look for the database /path/to/databases/hg19

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

akotlar / bystro

Programming Languages

Labels