Alternatives and detailed information of sentieon-dnaseq

Deprecated. Updated benchmarks available at https://github.com/Sentieon/sentieon-matching-performance

Match the results of GATK germline pipeline with Sentieon

Introduction

This documents describes the capabilities of Sentieon DNAseq pipeline matching different versions of GATK germline pipelines. If you have any additional questions, please visit https://www.sentieon.com/support or contact the technical support at Sentieon Inc. at [email protected].

Input and references

Fastq files of NA12878 were downloaded from FTP site:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140127_D00360_0011_AHGV6ADXX/Project_RM8398/

Hg38 and other databases were downloaded from GATK resource bundle.

See here: https://software.broadinstitute.org/gatk/download/bundle

Arguments	File
fasta	Homo_sapiens_assembly38.fasta
known_Mills	Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
known_1000G	1000G_phase1.snps.high_confidence.hg38.vcf.gz
known_dbsnp	dbsnp_146.hg38.vcf.gz
calling_intervals_list	wgs_calling_regions.hg38.interval_list

Data preprocessing

Step 1. BWA alignment

BWA 0.7.15-r1140:

bwa mem -M -Y -K 10000000 \
  -R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
  $fasta $fastq1 $fastq2 | \
  samtools sort -o sorted.bam
samtools index sorted.bam

Sentieon:

sentieon bwa mem -M -Y -K 10000000 \
  -R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
  $fasta $fastq1 $fastq2 | \
  sentieon util sort -i - \
  -r $fasta -o sorted.bam --sam2bam

Step 2. PCR duplicate removal

GATK3.7/3.8(Picard):

java -jar picard.jar MarkDuplicates \
  I=sorted.bam \
  O=deduplicated.bam \
  M=duplication.metrics \
  REMOVE_DUPLICATES=true \
  CREATE_INDEX=true

GATK4:

gatk MarkDuplicates \
  -I sorted.bam \
  -O deduplicated.bam \
  -M duplication.metrics \
  --REMOVE_DUPLICATES true \
  --CREATE_INDEX true

Sentieon:

sentieon driver -r $fasta -i sorted.bam \
  --algo LocusCollector --fun score_info score.txt.gz
sentieon driver -r $fasta -i sorted.bam \
  --algo Dedup --rmdup --score_info score.txt.gz deduped.bam

Step 3. Base Quality Score Recalibration

GATK 3.7/3.8:

java -jar GenomeAnalysisTK.jar \
  -T BaseRecalibrator \
  -I deduplicated.bam \
  -R $fasta \
  --knownSites $known_Mills \
  --knownSites $known_1000G \
  --knownSites $known_dbsnp \
  -o bqsr.grp
java -jar GenomeAnalysisTK.jar \
  -T PrintReads \
  -R $fasta \
  -I deduplicated.bam \
  -BQSR bqsr.grp \
  -o recalibrated.bam

GATK 4:

gatk BaseRecalibrator \
  -I deduplicated.bam \
  -R $fasta \
  --known-sites $known_Mills \
  --known-sites $known_1000G \
  --known-sites $known_dbsnp \
  -O bqsr.grp
gatk ApplyBQSR \
  -R $fasta \
  -I deduplicated.bam \
  --bqsr-recal-file bqsr.grp \
  -O recalibrated.bam

Sentieon*:

sentieon driver -r $fasta \
  -i deduped.bam \
  --algo QualCal \
  -k $known_dbsnp \
  -k $known_1000G \
  -k $known_Mills \
  recal_data.table

*Sentieon variant callers can perform the recalibration on the fly using a pre-recalibration bam plus the recalibration table. Recalibrated bam can be generated by the ReadWriter algo.

# This step is optional
sentieon driver -i deduped.bam -q recal_data.table --algo ReadWriter recaled.bam

Germline variant caller

Command line to compare GATK and Sentieon DNAseq results:

Output of GATK is used as the baseline.

hap.py \
GATK.vcf.gz \
Sentieon.vcf.gz \
-o output_dir \
-r Homo_sapiens_assembly38.fasta \
--engine=vcfeval \
--engine-vcfeval-template hs38.sdf

GATK 3.7/3.8:

Command line:

GATK 3.7/3.8:

java -jar GenomeAnalysisTK.jar \
  -T HaplotypeCaller \
  -ERC GVCF \
  -R $fasta \
  -L $calling_intervals_list \
  -I recalibrated.bam \
  -o output.g.vcf.gz
java -jar GenomeAnalysisTK.jar \
  -T GenotypeGVCFs \
  -R $fasta \
  -L $calling_intervals_list \
  --variant output.g.vcf.gz \
  --dbsnp $known_dbsnp \
  -o output.vcf.gz

Sentieon:

sentieon driver -r $fasta \
  -i deduped.bam \
  -q recal_data.table \
  --interval $calling_intervals_list \
  --algo Haplotyper \
  --emit_mode gvcf \
  output.g.vcf.gz
sentieon driver -r $fasta \
  --interval $calling_intervals_list \
  --algo GVCFtyper \
  -v output.g.vcf.gz \
  --call_conf 10 \
  --emit_conf 10 \
  -d $known_dbsnp \
  output.vcf.gz

Results:

Type	TRUTH			QUERY		METRIC
Type	TOTAL	TP	FN	TOTAL	FP	Recall	Precision	F1_Score
INDEL	848723	848238	485	874360	538	0.999429	0.999385	0.999407
SNP	4001821	4000797	1024	4005753	1033	0.999744	0.999742	0.999743

GATK 4.0

Command line:

GATK 4.0:

gatk HaplotypeCaller \
  -R $fasta \
  -L $calling_intervals_list \
  -I recalibrated.bam \
  -ERC GVCF \
  -O output.g.vcf.gz
gatk GenotypeGVCFs \
  -R $fasta \
  -L $calling_intervals_list \
  -V output.g.vcf.gz \
  --dbsnp $known_dbsnp \
  -O output.vcf.gz

Sentieon:

sentieon driver -r $fasta \
  -i deduped.bam \
  -q recal_data.table \
  --interval $calling_intervals_list \
  --algo Haplotyper \
  --emit_mode gvcf \
  output.g.vcf.gz
sentieon driver -r $fasta \
  --interval $calling_intervals_list \
  --algo GVCFtyper \
  -v output.g.vcf.gz \
  --call_conf 10 \
  --emit_conf 10 \
  -d $known_dbsnp \
  output.vcf.gz

Results:

Type	TRUTH			QUERY		METRIC
Type	TOTAL	TP	FN	TOTAL	FP	Recall	Precision	F1_Score
INDEL	849960	846375	3585	874364	2434	0.995782	0.997216	0.996499
SNP	4003643	3998527	5116	4005750	3319	0.998722	0.999171	0.998947

GATK 4.1

Command line:

GATK 4.1:

gatk HaplotypeCaller \
  -R $fasta \
  -L $calling_intervals_list \
  -I recalibrated.bam \
  -ERC GVCF \
  -O output.g.vcf.gz
gatk GenotypeGVCFs \
  -R $fasta \
  -L $calling_intervals_list \
  -V output.g.vcf.gz \
  --dbsnp $known_dbsnp \
  -O output.vcf.gz

Sentieon*:

sentieon driver -r $fasta \
  -i deduped.bam \
  -q recal_data.table \
  --interval $calling_intervals_list \
  --algo Haplotyper \
  --emit_mode gvcf \
  output.g.vcf.gz
sentieon driver -r $fasta \
  --interval $calling_intervals_list \
  --algo GVCFtyper \
  -v output.g.vcf.gz \
  -d $known_dbsnp \
  --genotype_model multinomial \
  output.vcf.gz

*Sentieon uses the option --genotype_model multinomial to match the output of the default newQual model in GATK 4.1.

Results:

Type	TRUTH			QUERY		METRIC
Type	TOTAL	TP	FN	TOTAL	FP	Recall	Precision	F1_Score
INDEL	855716	850790	4926	894426	10869	0.994243	0.987848	0.991035
SNP	3999272	3990379	8893	4006624	11826	0.997776	0.997048	0.997412

Runtime

Computing environment:

Google Compute Engine
n1-standard-32 (32 vCPUs, 120 GB memory)
Local SSD Scratch Disk 2x375G
centos-7-v20190619

Stage	Sentieon	GATK3.8	GATK4.0	GATK4.1
Alignment	2:42:44	5:38:35	5:49:39	5:45:39
Dedup	0:06:16	4:04:25	2:11:43	2:06:32
BQSR	0:10:10	4:17:09	1:39:57	1:40:06
HaplotypeCaller	0:41:02	3:21:37	6:56:53	5:37:52
GenotypeGVCFs	0:00:55	2:04:08	2:02:55	2:05:22
Total	3:41:07	19:25:54	18:41:07	17:15:31
Sentieon SpeedUp	--	5.3X	5.1X	4.7X

Benchmark with HG001 30x on AWS

System configuration

The benchmark was performed on two different instances. Both instances have Intel® Xeon® Platinum 8124M CPU @ 3.00GHz with dual stripped NVMe SSD.

Intance	vCPU	Memory
c5d.9xlarge	36	72GB
c5d.18xlarge	72	144GB

Benchmark results

On both instances, HG001 30x was processed and completed in less than 90 core-hours.

Machine	c5d.9xlarge		c5d.18xlarge
Stage	time (hh:mm)	core*hours	time(hh:mm)	core*hours
Alignment	01:41	60.67	00:54	65.12
LocusCollector	00:01	0.93	00:01	1.5
Dedup	00:03	1.47	00:03	2.48
BQSR	00:05	3.14	00:03	3.56
HC	00:24	14.41	00:13	16.16
GVCFtyper	00:01	0.3	00:01	0.34
Total	02:24	80.92	01:24	89.16

Accuracy Validation with Giab Truthset

Input Data files

For this evaluation, we used both HG001 and HG002 with depth of about 50x from the PrecisionFDA truth challenge. Reference b37 is used for this benchmark.

Sample	Reads 1	Reads 2
HG001 (50x)	HG001-NA12878-50x_1.fastq.gz	HG001-NA12878-50x_2.fastq.gz
HG002 (50x)	HG002-NA24385-50x_1.fastq.gz	HG002-NA24385-50x_2.fastq.gz

Truth set VCF and high-confidence region

The truthset of HG001 and HG002 can be found at Giab latest release page.

Name	File
HG001 VCF	HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz
HG001 BED	HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed
HG002 VCF	HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz
HG002 BED	HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed

Accuracy Benchmarking Results

Sample	Type	TP	FN	FP	Recall	Precision	F1_Score
HG001	INDEL	359926	3112	10133	0.9914	0.9726	0.9819
HG001	SNP	2785549	1741	7236	0.9994	0.9974	0.9984
HG002	INDEL	462614	806	1085	0.9983	0.9977	0.9980
HG002	SNP	3046197	1640	5339	0.9995	0.9983	0.9989

Using Sentieon DNAscope with machine learning model, we are able to further improve the variant calling accuracy. Please see DNAscope Machine Learning Model for more details.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Sentieon / sentieon-dnaseq

Programming Languages

Labels

Projects that are alternatives of or similar to sentieon-dnaseq

Match the results of GATK germline pipeline with Sentieon

Introduction

Input and references

Data preprocessing

Step 1. BWA alignment

Step 2. PCR duplicate removal

Step 3. Base Quality Score Recalibration

Germline variant caller

GATK 3.7/3.8:

GATK 4.0

GATK 4.1

Runtime

Benchmark with HG001 30x on AWS

System configuration

Benchmark results

Accuracy Validation with Giab Truthset

Input Data files

Truth set VCF and high-confidence region

Accuracy Benchmarking Results