All Projects → broadinstitute → dsde-deep-learning

broadinstitute / dsde-deep-learning

Licence: other
DSDE Deep Learning Club

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
shell
77523 projects
java
68154 projects - #9 most used programming language
scala
5932 projects
wdl
31 projects
processing
702 projects

Deep Learning Recipes for DNA reads and short variants.

Setting up your environment

We recommend using anaconda to handle your python environments. For CPU only libraries:

conda env create -n gatk -f ./envs/gatkcondaenv_cpu.yml

To use GPU, you will need a NVIDIA GPU, CUDA and CuDNN installed tensorflow has nice instructions:

conda env create -n gatk -f ./envs/gatkcondaenv_gpu.yml

Training models from example tensors

In the data directory we provide a small dataset of reference and read tensors from the NA12878 sample. The reference tensors are input for a 1D CNN. They are a 1-hot encoding of 128 base pairs of reference sequence centered at a variant. The read tensors are input for a 2D CNN. They encode reference and read sequence as well as read meta data. They use the tensorflow default channel ordering: reads x sequence x channels. You can toggle between tensorflow and theano channel ordering with the --channels_last and --channels_first arguments. Uncompress them with tar:

cd data
tar -xzvf example_reference_tensors_chr1.tar.gz 
tar -xzvf example_read_tensors_chr1_channels_last.tar.gz
cd ..

Train a model that predicts variant quality from read tensors and variant annotations:

python recipes.py train_ref_read_anno \
  --data_dir ./data/example_read_tensors_chr1_channels_last/ \
  --tensor_map read_tensor \
  --annotation_set best_practices \
  --id ref_read_anno_model

Train a model that predicts variant quality from read tensors:

python recipes.py train_ref_read \
  --data_dir ./data/example_read_tensors_chr1_channels_last/ \
  --tensor_map read_tensor \
  --id ref_read_model

Train a model that predicts variant quality from reference sequence and annotations:

python recipes.py train_reference_annotation \
  --data_dir ./data/example_reference_tensors_chr1/ \
  --tensor_map reference \
  --annotation_set best_practices \
  --id ref_anno_model

Train a model that predicts variant quality from reference sequence only:

python recipes.py train_reference \
  --data_dir ./data/example_reference_tensors_chr1/ \
  --tensor_map reference \
  --id ref_model

Write tensors with your own data

Create read tensors with a truth vcf, confident region, unfiltered variant calls, and aligned reads:

python recipes.py write_tensors \
  --reference_fasta reference.fasta \
  --train_vcf validated_calls.vcf.gz \
  --negative_vcf my_unfiltered_calls.vcf.gz \
  --bed_file validated_calls_confident_region.bed \
  --data_dir ./data/my_read_tensors/ \ 
  --bam_file my_aligned_reads.bam \
  --tensor_map read_tensor \
  --channels_last \
  --read_limit 128 \
  --window_size 128

Create reference tensors with a truth vcf, confident region, and unfiltered variant calls:

python recipes.py write_dna_tensors \
  --reference_fasta reference.fasta
  --train_vcf validated_calls.vcf.gz \
  --negative_vcf my_unfiltered_calls.vcf.gz \
  --bed_file validated_calls_confident_region.bed \
  --data_dir ./data/my_reference_tensors/ \ 
  --tensor_map reference \
  --window_size 128

You can downsample specific classes with the --downsample_class_label arguments. For example, to only write 10% of the positive SNPs add --downsample_snps 0.1 to your command line or to keep half of the negative indel examples use: --downsample_not_indels 0.5

You can also parallelize over the genome via the --chrom, --start_pos, and --end_pos arguments.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].