All Projects → marbl → Winnowmap

marbl / Winnowmap

Licence: other
Long read / genome alignment software

Programming Languages

c
50402 projects - #5 most used programming language
C++
36643 projects - #6 most used programming language
Makefile
30231 projects
perl
6916 projects
shell
77523 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Winnowmap

IsoQuant
Reference-based transcript discovery from long RNA read
Stars: ✭ 26 (-82.78%)
Mutual labels:  nanopore, pacbio
haslr
A fast tool for hybrid genome assembly of long and short reads
Stars: ✭ 68 (-54.97%)
Mutual labels:  nanopore, pacbio
fast-sg
Fast-SG: An alignment-free algorithm for ultrafast scaffolding graph construction from short or long reads.
Stars: ✭ 22 (-85.43%)
Mutual labels:  nanopore, pacbio
wengan
An accurate and ultra-fast hybrid genome assembler
Stars: ✭ 81 (-46.36%)
Mutual labels:  nanopore, pacbio
tiptoft
Predict plasmids from uncorrected long read data
Stars: ✭ 27 (-82.12%)
Mutual labels:  nanopore, pacbio
CliqueSNV
No description or website provided.
Stars: ✭ 13 (-91.39%)
Mutual labels:  pacbio
PopGenome
An Efficient Swiss Army Knife for Population Genomic Analyses in R
Stars: ✭ 13 (-91.39%)
Mutual labels:  genome-analysis
pepper
PEPPER-Margin-DeepVariant
Stars: ✭ 179 (+18.54%)
Mutual labels:  nanopore
poreplex
A versatile sequenced read processor for nanopore direct RNA sequencing
Stars: ✭ 74 (-50.99%)
Mutual labels:  nanopore
nanoflow
🔬 De novo assembly of nanopore reads using nextflow
Stars: ✭ 20 (-86.75%)
Mutual labels:  nanopore
swan vis
A Python library to visualize and analyze long-read transcriptomes
Stars: ✭ 35 (-76.82%)
Mutual labels:  pacbio
Clair3
Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
Stars: ✭ 119 (-21.19%)
Mutual labels:  nanopore
pipeline-pinfish-analysis
Pipeline for annotating genomes using long read transcriptomics data with pinfish
Stars: ✭ 27 (-82.12%)
Mutual labels:  nanopore
pipeline-structural-variation
Pipeline for calling structural variations in whole genomes sequencing Oxford Nanopore data
Stars: ✭ 104 (-31.13%)
Mutual labels:  nanopore
nanoseq
Nanopore demultiplexing, QC and alignment pipeline
Stars: ✭ 82 (-45.7%)
Mutual labels:  nanopore
awesome-nanopore
A curated list of awesome nanopore analysis tools.
Stars: ✭ 100 (-33.77%)
Mutual labels:  nanopore
pychopper
A tool to identify, orient, trim and rescue full length cDNA reads
Stars: ✭ 74 (-50.99%)
Mutual labels:  nanopore
poreCov
SARS-CoV-2 workflow for nanopore sequence data
Stars: ✭ 34 (-77.48%)
Mutual labels:  nanopore
minorseq
Minor Variant Calling and Phasing Tools
Stars: ✭ 15 (-90.07%)
Mutual labels:  pacbio
taeper
A small python program to simulate a real-time Nanopore sequencing run based on a previous experiment.
Stars: ✭ 18 (-88.08%)
Mutual labels:  nanopore

Winnowmap

Winnowmap is a long-read mapping algorithm optimized for mapping ONT and PacBio reads to repetitive reference sequences. Winnowmap development began on top of minimap2 codebase, and since then we have incorporated the following two ideas to improve mapping accuracy within repeats.

  • Winnowmap implements a novel weighted minimizer sampling algorithm (>=v1.0). This optimization was motivated by the need to avoid masking of frequently occurring k-mers during the seeding stage in an efficient manner, and achieve better mapping accuracy in complex repeats (e.g., long tandem repeats) of the human genome. Using weighted minimizers, Winnowmap down-weights frequently occurring k-mers, thus reducing their chance of getting selected as minimizers. Users can refer to this paper for more details. This idea is helpful to preserve the theoretical guarantee of minimizer sampling technique, i.e., if two sequences share a substring of a specified length, then they must be guaranteed to have a matching minimizer.

  • We noticed that the highest scoring alignment doesn't necessarily correspond to correct placement of reads in repetitive regions of T2T human chromosomes. In the presence of a non-reference allele within a repeat, a read sampled from that region could be mapped to an incorrect repeat copy because the standard pairwise sequence alignment scoring system penalizes true variants. This is also sometimes referred to as allelic bias. To address this bias, we introduced and implemented an idea of using minimal confidently alignable substrings (>=v2.0). These are minimal-length substrings in a read that align end-to-end to a reference with mapping quality score above a user-specified threshold. This approach treats each read mapping as a collection of confident sub-alignments, which is more tolerant of structural variation and more sensitive to paralog-specific variants (PSVs). Our most recent paper desribes this concept and benchmarking results.

Compile

Clone source code from master branch or download the latest release.

  git clone https://github.com/marbl/Winnowmap.git

Winnowmap compilation requires C++ compiler with c++11 and openmp, which are available by default in GCC >= 4.8.

  cd Winnowmap
  make -j8

Expect winnowmap and meryl executables in bin folder.

Usage

For either mapping long reads or computing whole-genome alignments, Winnowmap requires pre-computing high frequency k-mers (e.g., top 0.02% most frequent) in a reference. Winnowmap uses meryl k-mer counting tool for this purpose.

  • Mapping ONT or PacBio-hifi WGS reads
  meryl count k=15 output merylDB ref.fa
  meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

  winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont.fq.gz > output.sam  [OR]
  winnowmap -W repetitive_k15.txt -ax map-pb ref.fa hifi.fq.gz > output.sam
  • Mapping genome assemblies
  meryl count k=19 output merylDB asm1.fa
  meryl print greater-than distinct=0.9998 merylDB > repetitive_k19.txt

  winnowmap -W repetitive_k19.txt -ax asm20 asm1.fa asm2.fa > output.sam

For the genome-to-genome use case, it may be useful to visualize the dot plot. This perl script can be used to generate a dot plot from paf-formatted output. In both usage cases, pre-computing repetitive k-mers using meryl is quite fast, e.g., it typically takes 2-3 minutes for the human genome reference.

Benchmarking

When comparing Winnowmap (v1.0) to minimap2 (v2.17-r954), we observed a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome, and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. By avoiding masking, we show that Winnowmap maintains uniform minimizer density.


Minimizer sampling density using a human X chromosome as the reference, with the centromere positioned between 58 Mbp and 61 Mbp. ‘Standard’ method refers to the classic minimizer sampling algorithm from Roberts et al., without any masking or modification.

Publications

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].