All Projects → Sentieon → sentieon-dnaseq

Sentieon / sentieon-dnaseq

Licence: other
Sentieon DNAseq

Programming Languages

shell
77523 projects

Projects that are alternatives of or similar to sentieon-dnaseq

GenomicsDB
Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.
Stars: ✭ 77 (+327.78%)
Mutual labels:  variant-calling, genomics-data
ilus
A handy variant calling pipeline generator for whole genome re-sequencing (WGS) and whole exom sequencing data (WES) analysis. 一个简易且全面的 WGS/WES 分析流程生成器.
Stars: ✭ 64 (+255.56%)
Mutual labels:  bioinformatics-pipeline, variant-calling
TOGA
TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
Stars: ✭ 35 (+94.44%)
Mutual labels:  bioinformatics-pipeline
bystro
Bystro genetic analysis (annotation, filtering, statistics)
Stars: ✭ 31 (+72.22%)
Mutual labels:  bioinformatics-pipeline
dysgu
dysgu-SV is a collection of tools for calling structural variants using short or long reads
Stars: ✭ 47 (+161.11%)
Mutual labels:  variant-calling
Riverbed-Community-Toolkit
Riverbed Community Toolkit is a public toolkit for Riverbed Solutions engineering and integration
Stars: ✭ 16 (-11.11%)
Mutual labels:  acceleration
saffrontree
SaffronTree: Reference free rapid phylogenetic tree construction from raw read data
Stars: ✭ 17 (-5.56%)
Mutual labels:  bioinformatics-pipeline
iGenomics
The first app for Mobile DNA Sequence Alignment and Analysis
Stars: ✭ 33 (+83.33%)
Mutual labels:  dna-sequencing
inter-operator-scheduler
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
Stars: ✭ 133 (+638.89%)
Mutual labels:  acceleration
Literatures-on-GNN-Acceleration
A reading list for deep graph learning acceleration.
Stars: ✭ 50 (+177.78%)
Mutual labels:  acceleration
ivar
iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing.
Stars: ✭ 68 (+277.78%)
Mutual labels:  variant-calling
TeamTeri
Genomics using open source tools, running on GCP or AWS
Stars: ✭ 30 (+66.67%)
Mutual labels:  gatk
open-cravat
A modular annotation tool for genomic variants
Stars: ✭ 74 (+311.11%)
Mutual labels:  bioinformatics-pipeline
arcsv
Complex structural variant detection from WGS data
Stars: ✭ 16 (-11.11%)
Mutual labels:  variant-calling
plasmidtron
Assembling the cause of phenotypes and genotypes from NGS data
Stars: ✭ 27 (+50%)
Mutual labels:  bioinformatics-pipeline
rinzler
An autonomous parallel processing engine for the browser.
Stars: ✭ 36 (+100%)
Mutual labels:  acceleration
vdpau-va-driver-vp9
Experimental VP9 codec support for vdpau-va-driver (NVIDIA VDPAU-VAAPI wrapper) and chromium-vaapi
Stars: ✭ 68 (+277.78%)
Mutual labels:  acceleration
pyrpipe
Reproducible bioinformatics pipelines in python. Import any Unix tool/command in python.
Stars: ✭ 53 (+194.44%)
Mutual labels:  bioinformatics-pipeline
germline-DNA
A BioWDL variantcalling pipeline for germline DNA data. Starting with FASTQ files to produce VCF files. Category:Multi-Sample
Stars: ✭ 21 (+16.67%)
Mutual labels:  gatk4
ccs
CCS: Generate Highly Accurate Single-Molecule Consensus Reads (HiFi Reads)
Stars: ✭ 79 (+338.89%)
Mutual labels:  variant-calling

Deprecated. Updated benchmarks available at https://github.com/Sentieon/sentieon-matching-performance

Match the results of GATK germline pipeline with Sentieon

Introduction

This documents describes the capabilities of Sentieon DNAseq pipeline matching different versions of GATK germline pipelines. If you have any additional questions, please visit https://www.sentieon.com/support or contact the technical support at Sentieon Inc. at [email protected].

Input and references

Fastq files of NA12878 were downloaded from FTP site:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140127_D00360_0011_AHGV6ADXX/Project_RM8398/

Hg38 and other databases were downloaded from GATK resource bundle.

See here: https://software.broadinstitute.org/gatk/download/bundle

Arguments File
fasta Homo_sapiens_assembly38.fasta
known_Mills Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
known_1000G 1000G_phase1.snps.high_confidence.hg38.vcf.gz
known_dbsnp dbsnp_146.hg38.vcf.gz
calling_intervals_list wgs_calling_regions.hg38.interval_list

Data preprocessing

Step 1. BWA alignment

BWA 0.7.15-r1140:

bwa mem -M -Y -K 10000000 \
  -R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
  $fasta $fastq1 $fastq2 | \
  samtools sort -o sorted.bam
samtools index sorted.bam

Sentieon:

sentieon bwa mem -M -Y -K 10000000 \
  -R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
  $fasta $fastq1 $fastq2 | \
  sentieon util sort -i - \
  -r $fasta -o sorted.bam --sam2bam

Step 2. PCR duplicate removal

GATK3.7/3.8(Picard):

java -jar picard.jar MarkDuplicates \
  I=sorted.bam \
  O=deduplicated.bam \
  M=duplication.metrics \
  REMOVE_DUPLICATES=true \
  CREATE_INDEX=true

GATK4:

gatk MarkDuplicates \
  -I sorted.bam \
  -O deduplicated.bam \
  -M duplication.metrics \
  --REMOVE_DUPLICATES true \
  --CREATE_INDEX true

Sentieon:

sentieon driver -r $fasta -i sorted.bam \
  --algo LocusCollector --fun score_info score.txt.gz
sentieon driver -r $fasta -i sorted.bam \
  --algo Dedup --rmdup --score_info score.txt.gz deduped.bam

Step 3. Base Quality Score Recalibration

GATK 3.7/3.8:

java -jar GenomeAnalysisTK.jar \
  -T BaseRecalibrator \
  -I deduplicated.bam \
  -R $fasta \
  --knownSites $known_Mills \
  --knownSites $known_1000G \
  --knownSites $known_dbsnp \
  -o bqsr.grp
java -jar GenomeAnalysisTK.jar \
  -T PrintReads \
  -R $fasta \
  -I deduplicated.bam \
  -BQSR bqsr.grp \
  -o recalibrated.bam

GATK 4:

gatk BaseRecalibrator \
  -I deduplicated.bam \
  -R $fasta \
  --known-sites $known_Mills \
  --known-sites $known_1000G \
  --known-sites $known_dbsnp \
  -O bqsr.grp
gatk ApplyBQSR \
  -R $fasta \
  -I deduplicated.bam \
  --bqsr-recal-file bqsr.grp \
  -O recalibrated.bam

Sentieon*:

sentieon driver -r $fasta \
  -i deduped.bam \
  --algo QualCal \
  -k $known_dbsnp \
  -k $known_1000G \
  -k $known_Mills \
  recal_data.table

*Sentieon variant callers can perform the recalibration on the fly using a pre-recalibration bam plus the recalibration table. Recalibrated bam can be generated by the ReadWriter algo.

# This step is optional
sentieon driver -i deduped.bam -q recal_data.table --algo ReadWriter recaled.bam

Germline variant caller

Command line to compare GATK and Sentieon DNAseq results:

Output of GATK is used as the baseline.

hap.py \
GATK.vcf.gz \
Sentieon.vcf.gz \
-o output_dir \
-r Homo_sapiens_assembly38.fasta \
--engine=vcfeval \
--engine-vcfeval-template hs38.sdf

GATK 3.7/3.8:

Command line:

GATK 3.7/3.8:

java -jar GenomeAnalysisTK.jar \
  -T HaplotypeCaller \
  -ERC GVCF \
  -R $fasta \
  -L $calling_intervals_list \
  -I recalibrated.bam \
  -o output.g.vcf.gz
java -jar GenomeAnalysisTK.jar \
  -T GenotypeGVCFs \
  -R $fasta \
  -L $calling_intervals_list \
  --variant output.g.vcf.gz \
  --dbsnp $known_dbsnp \
  -o output.vcf.gz

Sentieon:

sentieon driver -r $fasta \
  -i deduped.bam \
  -q recal_data.table \
  --interval $calling_intervals_list \
  --algo Haplotyper \
  --emit_mode gvcf \
  output.g.vcf.gz
sentieon driver -r $fasta \
  --interval $calling_intervals_list \
  --algo GVCFtyper \
  -v output.g.vcf.gz \
  --call_conf 10 \
  --emit_conf 10 \
  -d $known_dbsnp \
  output.vcf.gz

Results:

Type TRUTH QUERY METRIC
TOTAL TP FN TOTAL FP Recall Precision F1_Score
INDEL 848723 848238 485 874360 538 0.999429 0.999385 0.999407
SNP 4001821 4000797 1024 4005753 1033 0.999744 0.999742 0.999743

GATK 4.0

Command line:

GATK 4.0:

gatk HaplotypeCaller \
  -R $fasta \
  -L $calling_intervals_list \
  -I recalibrated.bam \
  -ERC GVCF \
  -O output.g.vcf.gz
gatk GenotypeGVCFs \
  -R $fasta \
  -L $calling_intervals_list \
  -V output.g.vcf.gz \
  --dbsnp $known_dbsnp \
  -O output.vcf.gz

Sentieon:

sentieon driver -r $fasta \
  -i deduped.bam \
  -q recal_data.table \
  --interval $calling_intervals_list \
  --algo Haplotyper \
  --emit_mode gvcf \
  output.g.vcf.gz
sentieon driver -r $fasta \
  --interval $calling_intervals_list \
  --algo GVCFtyper \
  -v output.g.vcf.gz \
  --call_conf 10 \
  --emit_conf 10 \
  -d $known_dbsnp \
  output.vcf.gz

Results:

Type TRUTH QUERY METRIC
TOTAL TP FN TOTAL FP Recall Precision F1_Score
INDEL 849960 846375 3585 874364 2434 0.995782 0.997216 0.996499
SNP 4003643 3998527 5116 4005750 3319 0.998722 0.999171 0.998947

GATK 4.1

Command line:

GATK 4.1:

gatk HaplotypeCaller \
  -R $fasta \
  -L $calling_intervals_list \
  -I recalibrated.bam \
  -ERC GVCF \
  -O output.g.vcf.gz
gatk GenotypeGVCFs \
  -R $fasta \
  -L $calling_intervals_list \
  -V output.g.vcf.gz \
  --dbsnp $known_dbsnp \
  -O output.vcf.gz

Sentieon*:

sentieon driver -r $fasta \
  -i deduped.bam \
  -q recal_data.table \
  --interval $calling_intervals_list \
  --algo Haplotyper \
  --emit_mode gvcf \
  output.g.vcf.gz
sentieon driver -r $fasta \
  --interval $calling_intervals_list \
  --algo GVCFtyper \
  -v output.g.vcf.gz \
  -d $known_dbsnp \
  --genotype_model multinomial \
  output.vcf.gz

*Sentieon uses the option --genotype_model multinomial to match the output of the default newQual model in GATK 4.1.

Results:

Type TRUTH QUERY METRIC
TOTAL TP FN TOTAL FP Recall Precision F1_Score
INDEL 855716 850790 4926 894426 10869 0.994243 0.987848 0.991035
SNP 3999272 3990379 8893 4006624 11826 0.997776 0.997048 0.997412

Runtime

Computing environment:

  • Google Compute Engine
  • n1-standard-32 (32 vCPUs, 120 GB memory)
  • Local SSD Scratch Disk 2x375G
  • centos-7-v20190619

Stage Sentieon GATK3.8 GATK4.0 GATK4.1
Alignment 2:42:44 5:38:35 5:49:39 5:45:39
Dedup 0:06:16 4:04:25 2:11:43 2:06:32
BQSR 0:10:10 4:17:09 1:39:57 1:40:06
HaplotypeCaller 0:41:02 3:21:37 6:56:53 5:37:52
GenotypeGVCFs 0:00:55 2:04:08 2:02:55 2:05:22
Total 3:41:07 19:25:54 18:41:07 17:15:31
Sentieon SpeedUp -- 5.3X 5.1X 4.7X

Benchmark with HG001 30x on AWS

System configuration

The benchmark was performed on two different instances. Both instances have Intel® Xeon® Platinum 8124M CPU @ 3.00GHz with dual stripped NVMe SSD.

Intance vCPU Memory
c5d.9xlarge 36 72GB
c5d.18xlarge 72 144GB

Benchmark results

On both instances, HG001 30x was processed and completed in less than 90 core-hours.

Machine c5d.9xlarge c5d.18xlarge
Stage time (hh:mm) core*hours time(hh:mm) core*hours
Alignment 01:41 60.67 00:54 65.12
LocusCollector 00:01 0.93 00:01 1.5
Dedup 00:03 1.47 00:03 2.48
BQSR 00:05 3.14 00:03 3.56
HC 00:24 14.41 00:13 16.16
GVCFtyper 00:01 0.3 00:01 0.34
Total 02:24 80.92 01:24 89.16

Accuracy Validation with Giab Truthset

Input Data files

For this evaluation, we used both HG001 and HG002 with depth of about 50x from the PrecisionFDA truth challenge. Reference b37 is used for this benchmark.

Sample Reads 1 Reads 2
HG001 (50x) HG001-NA12878-50x_1.fastq.gz HG001-NA12878-50x_2.fastq.gz
HG002 (50x) HG002-NA24385-50x_1.fastq.gz HG002-NA24385-50x_2.fastq.gz

Truth set VCF and high-confidence region

The truthset of HG001 and HG002 can be found at Giab latest release page.

Name File
HG001 VCF HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz
HG001 BED HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed
HG002 VCF HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz
HG002 BED HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed

Accuracy Benchmarking Results

Sample Type TP FN FP Recall Precision F1_Score
HG001 INDEL 359926 3112 10133 0.9914 0.9726 0.9819
SNP 2785549 1741 7236 0.9994 0.9974 0.9984
HG002 INDEL 462614 806 1085 0.9983 0.9977 0.9980
SNP 3046197 1640 5339 0.9995 0.9983 0.9989

Using Sentieon DNAscope with machine learning model, we are able to further improve the variant calling accuracy. Please see DNAscope Machine Learning Model for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].