All Projects → ksahlin → StrobeAlign

ksahlin / StrobeAlign

Licence: MIT license
Aligns short reads using dynamic seed size with strobemers

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
CMake
9771 projects

Projects that are alternatives of or similar to StrobeAlign

ultra
uLTRA is a long-read splice aligner with high accuracy from using a guiding annotation
Stars: ✭ 47 (-4.08%)
Mutual labels:  alignment, strobemers
MA
The Modular Aligner and The Modular SV Caller
Stars: ✭ 39 (-20.41%)
Mutual labels:  alignment
cacao
Callable Cancer Loci - assessment of sequencing coverage for actionable and pathogenic loci in cancer
Stars: ✭ 21 (-57.14%)
Mutual labels:  alignment
CliqueSNV
No description or website provided.
Stars: ✭ 13 (-73.47%)
Mutual labels:  illumina
unicode display width
Displayed width of UTF-8 strings in Modern C++
Stars: ✭ 30 (-38.78%)
Mutual labels:  alignment
ngs-preprocess
A pipeline for preprocessing NGS data from Illumina, Nanopore and PacBio technologies
Stars: ✭ 22 (-55.1%)
Mutual labels:  illumina
recount
R package for the recount2 project. Documentation website: http://leekgroup.github.io/recount/
Stars: ✭ 40 (-18.37%)
Mutual labels:  illumina
nanoseq
Nanopore demultiplexing, QC and alignment pipeline
Stars: ✭ 82 (+67.35%)
Mutual labels:  alignment
mtcnn tf
MTCNN implement by tensorflow. Easy to training and testing.
Stars: ✭ 41 (-16.33%)
Mutual labels:  alignment
mapping-iterative-assembler
Consensus calling (or "reference assisted assembly"), chiefly of ancient mitochondria
Stars: ✭ 15 (-69.39%)
Mutual labels:  alignment
minineedle
Needleman-Wunsch and Smith-Waterman algorithms in python
Stars: ✭ 27 (-44.9%)
Mutual labels:  alignment
Peppa-Facial-Landmark-PyTorch
Facial Landmark Detection based on PyTorch
Stars: ✭ 172 (+251.02%)
Mutual labels:  alignment
wengan
An accurate and ultra-fast hybrid genome assembler
Stars: ✭ 81 (+65.31%)
Mutual labels:  illumina
tracy
Basecalling, alignment, assembly and deconvolution of Sanger Chromatogram trace files
Stars: ✭ 73 (+48.98%)
Mutual labels:  alignment
FAIR.m
Flexible Algorithms for Image Registration
Stars: ✭ 103 (+110.2%)
Mutual labels:  alignment
SpatialAlignment
Helpful components for aligning and keeping virtual objects aligned with the physical world.
Stars: ✭ 29 (-40.82%)
Mutual labels:  alignment
ChromAlignNet
Deep neural network for the alignment of GC-MS peaks
Stars: ✭ 29 (-40.82%)
Mutual labels:  alignment
alignment-nf
Whole Exome/Whole Genome Sequencing alignment pipeline
Stars: ✭ 19 (-61.22%)
Mutual labels:  alignment
fq
Command line utility for manipulating Illumina-generated FastQ files.
Stars: ✭ 31 (-36.73%)
Mutual labels:  illumina
nightlight
Nightlight: Astronomic Image Processing
Stars: ✭ 25 (-48.98%)
Mutual labels:  alignment

strobealign

Strobealign is a fast short-read aligner. It achieves the speedup by using a dynamic seed size obtained from syncmer-thinned strobemers. Strobealign is multithreaded, implements alignment (SAM) and mapping (PAF), and benchmarked for SE and PE reads of lengths between 100-300bp. A preprint describing v0.4 is available here. Current version is 0.7.1.

See INSTALLATION and USAGE to install and run strobealign. See v07 PERFORMANCE for the updated accuracy and runtime performance of strobealign and release notes for all the updates since v0.4 described in the preprint.

INSTALLATION

Conda

Strobealign can be installed through conda. Simply run

conda create -n strobealign strobealign

Binaries

You can acquire precompiled binaries for Linux and Mac OSx from the release page compiled with -O3 -mavx2.

It has been reported that strobealign is even faster if compliled with flag -march=skylake-avx512 for avx512 supported processors.

From source

If you want to compile from the source, you need to have a newer g++ and zlib installed. Then do the following:

git clone https://github.com/ksahlin/StrobeAlign
cd StrobeAlign
# Needs a newer g++ version. Tested with version 8 and upwards.
g++ -std=c++14  main.cpp source/index.cpp source/xxhash.c source/ssw_cpp.cpp source/ssw.c source/pc.cpp source/aln.cpp -lz -lpthread -o strobealign -O3 -mavx2
Zlib linking error

If you have zlib installed, and the zlib.h file is in folder /path/to/zlib/include and the libz.so file in /path/to/zlib/lib but you get

main.cpp:12:10: fatal error: zlib.h: No such file or directory
 #include <zlib.h>
          ^~~~~~~~
compilation terminated.

add -I/path/to/zlib/include -L/path/to/zlib/lib to the compilation, that is

g++ -std=c++14 -I/path/to/zlib/include -L/path/to/zlib/lib  main.cpp source/index.cpp source/xxhash.c source/ssw_cpp.cpp source/ssw.c source/pc.cpp source/aln.cpp -lz -lpthread -o strobealign -O3 -mavx2

USAGE

Alignment

For alignment to SAM file:

strobealign ref.fa reads.fa > output.sam

To report secondary alignments, set parameter -N [INT] for maximum of [INT] secondary alignments.

Mapping

For mapping to PAF file (option -x):

strobealign -x ref.fa reads.fa > output.sam

V0.7 PERFORMANCE

We have in below three sections investigated accuracy and runtime metrics for v0.7 on SIM3 and REPEATS datasets included in the preprint, as well as performance of SNV and small indel calling for additional simulated and biological (GIAB) datasets.

For the biological SNV and indel experiments, we used GIAB datasets (HG004; Mother) with 2x150bp reads (subsampled to ~26x coverage) and 2x250bp reads (~17x coverage).

Mapping accuracy and runtime

Below shows the accuracy (panel A) runtime (panel B) and %-aligned reads (panel C) for the SIM3 (Fig 1) and REPEATS (Fig 2) datasets in the preprint using strobealign v0.7. On all but the 2x100 datasets, strobealign has comparable or higher accuracy than BWA MEM while being substantially faster. On the 2x100 datasets, strobealign has the second highest accuracy after BWA MEM on SIM3 while being substantially faster, and comparable accuracy to minimap2 and BWA MEM on the REPEATS dataset while being twice as fast.

v0 6 1_sim3 001 jpeg 001 Figure 1. Accuracy (panel A) runtime (panel B) and %-aligned reads (panel C) for the SIM3 dataset

v0 6 1_repeats_experiment 001 Figure 2. Accuracy (panel A) runtime (panel B) and %-aligned reads (panel C) for the REPEATS dataset

Variant calling benchmark (simulated REPEATS)

A small SNV and INDEL calling benchmark with strobealign v0.7 is provided below. We used bcftools to call SNPs and indels on a simulated repetitive genome based on alignments from strobealign, BWA-MEM, and minimap2 (ran with 1 core). The genome is a 16.8Mbp sequence consisting of 500 concatenated copies of a 40kbp sequence which is mutated through substitutions (5%) and removing segments of size 1bp-1kbp (0.5%) along the oringinal 20Mbp string.

Then, 2 million paired-end reads (lengths 100, 150, 200, 250, 300) from a related genome with high variation rate: 0.5% SNVs and 0.5% INDELs. The challange is to find the right location of reads in the repetitive genome to predict the SNVs and INDELs in the related genome. In the genome where the reads are simulated from there is about 78k SNVs and INDELS, respectively. Locations of true SNVs and INDELs and provided by the read simulator. The precision (P), recall (R), and F-score are computed based on the true variants (for details see section Variant calling benchmark method). Results in table below.

In the experiments strobealign is in general the fastest tool, has the highest SNV precision, and highest precision, recall, and F-score for indels.

There are frequent indels in this dataset (every 200th bases on average) requiring calls to base level alignments for most reads. Between 65-85% of strobealign's runtime is spent on base level alignments with third-party SSW alignment module. The longer the reads the higher % of time is spent on base level alignment. Speed improvements to base-level alignment libraries will greatly reduce runtime on this dataset.

Read length Tool SNVs (P) SNVs (R) SNVs (F-score) Indels (P) Indels (R) Indels (F-score) Alignment time (s)
100 strobealign 97.9 93.5 95.6 55.6 41.1 47.2 424
  minimap2 91.4 94.3 92.8 55.2 39.1 45.8 605
  bwa_mem 93.7 95.9 94.8 55.3 30.0 38.9 1020
                 
150 strobealign 96.6 92.7 94.6 55.2 46.2 50.3 350
  minimap2 89.8 94.6 92.1 54.9 44.8 49.3 902
  bwa_mem 96.0 96.0 96.0 55.0 39.6 46.1 1010
                 
200 strobealign 97.4 94.1 95.7 55.3 45.8 50.1 487
  minimap2 88.1 96.7 92.2 55.0 44.7 49.3 1290
  bwa_mem 95.2 96.5 95.8 55.1 42.3 47.8 1263
                 
250 strobealign 96.4 93.3 94.8 55.1 45.0 49.6 697
  minimap2 87.7 94.8 91.1 54.9 43.8 48.7 998
  bwa_mem 94.3 96.2 95.2 55.1 42.3 47.8 1593
                 
300 strobealign 95.7 92.7 94.1 55.1 44.5 49.2 1005
  minimap2 88.2 94.3 91.2 54.8 43.4 48.4 1046
  bwa_mem 93.7 96.4 95.0 54.9 42.0 47.6 1988

Variant calling benchmark (simulated SIM3)

We simulated 2x150 and 2x250 reads at 30x coverage from a human genome with SNV and indel rate according to the SIM3 genome (described in the preprint). We aligned the reads to hg38 without alternative haplotypes as proposed here. We used 16 cores for all aligners.

Results are shown for SNVs and indels separately in Figure 3. For SNVs, predictions with strobealign as the aligner have an F-score on par with most other aligners. BWA has the best performance on this dataset. However, indel predictions have both the highest recall and precision using strobealign. Minimap2 is the close second best aligner for calling indels on this dataset, having only 0.1% lower recall and precision to strobealign.

sv_calling_sim 001 Figure 3. Recall precision and F-score for the aligners on 2x150 and 2x250 datasets from SIM3.

Variant calling benchmark (GIAB)

We used Illumina paired-end reads from the GIAB datasets HG004 (Mother) with the 2x150bp reads (subsampled to ~26x coverage; using only the reads in 140818_D00360_0047_BHA66FADXX/Project_RM8392) and 2x250bp reads (~17x coverage). We aligned the reads to hg38 without alternative haplotypes as proposed here. We used 16 cores for all aligners. We obtain the "true" SNVs and INDELs from the GIAB gold standard predictions formed from several sequencing technologies. They are provided here.

Results are shown for SNVs and indels separately in Figure 4. For SNVs, predictions with strobealign as the aligner have the highest F-score of all benchmarked aligners on both datasets. Strobealign's alignments yield the highest precision at the cost of a slightly lower recall. As for indels, predictions have a low recall, precision, and F-score with all aligners. This may be because we benchmarked against all gold standard SVs for HG004 that were not SNVs (see below method for the evaluation). Overall, predictions using Bowtie2 are the most desirable on these datasets.

sv_calling 001 Figure 4. Recall precision and F-score for the aligners on 2x150 and 2x250 datasets from HG004.

Runtime

For the four larger datasets above we show the runtime of aligners using 16 threads in Figure 5. The two SIM3 datasets are denoted SIM150 and SIM250, and the two GIAB datasets are denoted BIO150 and BIO250. Urmap was excluded from the timing benchmark because we can only get it to run with 1 core on our server as reported here. Strobealign is the fastest aligner across datasets. While urmap could be faster (based on the singlethreaded benchmarks), strobealign has substaintially better accuracy and downstream SV calling statistics (as seen in previous sections).

runtime_sv Figure 5. Runtime of aligners using 16 threads on two simulaed and two biological datasets of about 20-30x coverage of a human genome.

Variant calling benchmark method

For the results, we ran

bcftools mpileup -O z --fasta-ref ref aligned.bam > aligned.vcf.gz
bcftools call -v -c -O v aligned.vcf.gz > aligned.variants.vcf.gz

# Split into SNP and INDELS
grep -v -E -e "INDEL;" aligned.variants.vcf.gz > aligned.variants.SNV.vcf
grep "#"  aligned.variants.vcf.gz > aligned.variants.INDEL.vcf
grep -E -e "INDEL;" aligned.variants.vcf.gz >> aligned.variants.INDEL.vcf

# Separate GIAB SNVs and INDELS
shell('zgrep "#" true.variants.vcf > true.variants.SNV.vcf')
shell('zgrep -P  "\t[ACGT]\t[ACGT]\t" true.variants.vcf >> true.variants.SNV.vcf')
shell('zgrep -v -P  "\t[ACGT]\t[ACGT]\t" true.variants.vcf > true.variants.INDEL.vcf')

for type in SNV INDEL
do
	bcftools sort -Oz aligned.variants.$type.vcf.gz -o aligned.variants.sorted.$type.vcf.gz
	bcftools index aligned.variants.sorted.$type.vcf.gz
	bcftools isec --nfiles 2 -O u true_variants.sorted.$type.vcf.gz  aligned.variants.sorted.$type.vcf -p out_$type
done

CREDITS

Kristoffer Sahlin. Flexible seed size enables ultra-fast and accurate read alignment. bioRxiv, 2021. doi:10.1101/2021.06.18.449070. Preprint available here.

VERSION INFO

See release page

LICENCE

MIT license, see LICENSE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].