All Projects → kaist-ina → BWA-MEME

kaist-ina / BWA-MEME

Licence: MIT license
Faster BWA-MEM2 using learned-index

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
rust
11053 projects
CMake
9771 projects
Makefile
30231 projects
shell
77523 projects

Projects that are alternatives of or similar to BWA-MEME

galaksio
An easy-to-use way for running Galaxy workflows.
Stars: ✭ 19 (-75.32%)
Mutual labels:  ngs
bin
My bioinfo toolbox
Stars: ✭ 42 (-45.45%)
Mutual labels:  ngs
disq
A library for manipulating bioinformatics sequencing formats in Apache Spark
Stars: ✭ 29 (-62.34%)
Mutual labels:  ngs
Jovian
Metagenomics/viromics pipeline that focuses on automation, user-friendliness and a clear audit trail. Jovian aims to empower classical biologists and wet-lab personnel to do metagenomics/viromics analyses themselves, without bioinformatics expertise.
Stars: ✭ 14 (-81.82%)
Mutual labels:  ngs
grape-nf
An automated RNA-seq pipeline using Nextflow
Stars: ✭ 30 (-61.04%)
Mutual labels:  ngs
STing
Ultrafast sequence typing and gene detection from NGS raw reads
Stars: ✭ 15 (-80.52%)
Mutual labels:  ngs
rvtests
Rare variant test software for next generation sequencing data
Stars: ✭ 114 (+48.05%)
Mutual labels:  ngs
IMPACT-Pipeline
Framework to process and call somatic variation from NGS dataset generated using MSK-IMPACT assay
Stars: ✭ 52 (-32.47%)
Mutual labels:  ngs
Circle-Map
A method for circular DNA detection based on probabilistic mapping of ultrashort reads
Stars: ✭ 45 (-41.56%)
Mutual labels:  ngs
ngs pipeline
Exome/Capture/RNASeq Pipeline Implementation using snakemake
Stars: ✭ 40 (-48.05%)
Mutual labels:  ngs
ngs-in-bioc
A course on Analysing Next Generation (/High Throughput etc..) Sequencing data using Bioconductor
Stars: ✭ 37 (-51.95%)
Mutual labels:  ngs
bac-genomics-scripts
Collection of scripts for bacterial genomics
Stars: ✭ 39 (-49.35%)
Mutual labels:  ngs
ngs-preprocess
A pipeline for preprocessing NGS data from Illumina, Nanopore and PacBio technologies
Stars: ✭ 22 (-71.43%)
Mutual labels:  ngs
PaSGAL
Parallel Sequence to Graph Alignment
Stars: ✭ 35 (-54.55%)
Mutual labels:  alignment-algorithm
myVCF
myVCF: a web-based platform for target and exome mutations data management
Stars: ✭ 18 (-76.62%)
Mutual labels:  ngs
ngs-test-data
A workflow for creating small NGS test data sets, useful for continuous integration.
Stars: ✭ 19 (-75.32%)
Mutual labels:  ngs
CliqueSNV
No description or website provided.
Stars: ✭ 13 (-83.12%)
Mutual labels:  ngs
angsd-wrapper
Utilities for analyzing next generation sequencing data.
Stars: ✭ 13 (-83.12%)
Mutual labels:  ngs
bio-dockers
🐳 Bio-dockers: dockerized bioinformatic tools
Stars: ✭ 33 (-57.14%)
Mutual labels:  ngs
needlestack
Multi-sample somatic variant caller
Stars: ✭ 45 (-41.56%)
Mutual labels:  ngs

BWA-MEME: BWA-MEM emulated with a machine learning approach

  • BWA-MEME produces identical results as BWA-MEM2 and achieves 1.4x higher alignment throughput.
  • Seeding throughput of BWA-MEME is up to 3.32x higher than BWA-MEM2.
  • BWA-MEME builds upon BWA-MEM2 and includes performance improvements to the seeding.
  • BWA-MEME leverages learned index in suffix array search.
  • BWA-MEME also provides feature to accomodate various memory size in servers.

Contents


When to use BWA-MEME

  • Anyone who use BWA-MEM or BWA-MEM2 in CPU-only machine (BWA-MEME requires 38GB of memory for index at minimal mode)
  • Building high-throughput NGS alignment cluster with low cost/throughput. CPU-only alignment can be cheaper than using hardware acceleration (GPU, FPGA).
  • Just add single option "-7" to deploy BWA-MEME instead of BWA-MEM2 (BWA-MEME does not change anything, except the speed).

Performance of BWA-MEME

The seeding module of BWA-MEME uses Learned-index. This, in turn, results in 3.32x higher seeding throughput compared to FM-index of BWA-MEM2.

End-to-end alignment throughput is up to 1.4x higher than BWA-MEM2.


Getting Started

Install Option 1. Bioconda

# Install with conda, bwa-meme and the learned-index train script "build_rmis_dna.sh" will be installed
conda install -c conda-forge -c bioconda bwa-meme

# Print version and Mode of compiled binary executable
# bwa-meme binary automatically choose the binary based on the SIMD instruction supported (SSE, AVX2, AVX512 ...)
# Other modes of bwa-meme is available as bwa-meme_mode1 or bwa-meme_mode2
bwa-meme version

Build index of the reference DNA sequence

# Build index (Takes ~1hr for human genome)
# we recommend using at least 8 threads
bwa-meme index -a meme <input.fasta> -t <thread number>

Training P-RMI

# Run code below to train P-RMI, suffix array is required which is generated in index build code
# takes about 15 minute for human genome with single thread
build_rmis_dna.sh <input.fasta>

Run alignment and compare SAM output with BWA-MEM2

# Perform alignment with BWA-MEME, add -7 option
bwa-meme mem -7 -Y -K 100000000 -t <num_threads> <input.fasta> <input_1.fastq> -o <output_meme.sam>

# Below runs alignment with BWA-MEM2, without -7 option
bwa-meme mem -Y -K 100000000 -t <num_threads> <input.fasta> <input_1.fastq> -o <output_mem2.sam>

# Compare output SAM files
diff <output_mem2.sam> <output_meme.sam>

# To diff large SAM files use https://github.com/unhammer/diff-large-files

Install Option 2. Build locally

Compile the code

# Compile from source
git clone https://github.com/kaist-ina/BWA-MEME.git BWA-MEME
cd BWA-MEME

# To compile all binary executables run below command. 
# Put the highest number of available vCPU cores
# You should also have cmake installed. Download by sudo apt-get install cmake
make -j<num_threads>

# Print version and Mode of compiled binary executable
# bwa-meme binary automatically choose the binary based on the SIMD instruction supported (SSE, AVX2, AVX512 ...)
# Other modes of bwa-meme is available as bwa-meme_mode1 or bwa-meme_mode2
./bwa-meme version

# For bwa-meme with mode 1 or 2 see below

Build index of the reference DNA sequence

# Build index (Takes ~1hr for human genome)
# we recommend using 32 threads
./bwa-meme index -a meme <input.fasta> -t <thread number>

Training P-RMI

Prerequisites for building locally: To use the train code, please install Rust.

# Run code below to train P-RMI, suffix array is required which is generated in index build code
# takes about 15 minute for human genome with single thread
./build_rmis_dna.sh <input.fasta>

Run alignment and compare SAM output with BWA-MEM2

# Perform alignment with BWA-MEME, add -7 option
./bwa-meme mem -7 -Y -K 100000000 -t <num_threads> <input.fasta> <input_1.fastq> -o <output_meme.sam>

# Below runs alignment with BWA-MEM2, without -7 option
./bwa-meme mem -Y -K 100000000 -t <num_threads> <input.fasta> <input_1.fastq> -o <output_mem2.sam>

# Compare output SAM files
diff <output_mem2.sam> <output_meme.sam>

# To diff large SAM files use https://github.com/unhammer/diff-large-files

Test scripts and executables are available in the BWA-MEME/test folder


Changing memory requirement for index in BWA-MEME

# You can check the MODE value by running version command
# mode 1: 38GB in index size
./bwa-meme_mode1 version
# mode 2: 88GB in index size
./bwa-meme_mode2 version
# mode 3: 118GB in index size, fastest mode
./bwa-meme  version

# If binary executable does not exist, run below command to compile
make clean
make -j<number of threads>

Notes

  • BWA-MEME requires at least 64 GB RAM (with minimal acceleration BWA-MEME requires 38GB of memory). For WGS runs on human genome (>32 threads) with full acceleration of BWA-MEME, it is recommended to have 140-192 GB RAM.

  • When deploying BWA-MEME with many threads, mimalloc library is recommended for a better performance (Enabled at default).

Building pipeline with Samtools

Credits to @keiranmraine, see issue #10

  • Due to increased alignment throughput, given enough threads the bottleneck moves from alignment to Samtools sorting. As a result BWA-MEME might require additional pipeline modification (not a simple drop-in replacement)
  • While existing pipeline might be still faster (reported in issue #10), CPU can be wasted.
  • To reduce the CPU waste, you might want to use mbuffer in the pipeline or write alignment outputs to a file with fast compression.
  • We will investigate a faster way to incorporate BWA-MEME and Samtools sorting

Reference file download

You can download the reference using the command below.

# Download human_g1k_v37.fasta human genome and decompress it
wget -c ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
gunzip human_g1k_v37.fasta.gz

# hg38 human reference
wget -c https://storage.googleapis.com/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta

Download MEME indices and pretrained P-RMI model

# We provide the pretrained models and all indices required alignment (for hg37 and hg38 human reference) 
# you can download in the link below.
https://web.inalab.net/~bwa-meme/

# Indices of MEME and models should be in the same folder, we follow the prefix-based loading in bwa-mem

Citation

If you use BWA-MEME, please cite the following paper

Youngmok Jung, Dongsu Han, BWA-MEME: BWA-MEM emulated with a machine learning approach, Bioinformatics, Volume 38, Issue 9, 1 May 2022, Pages 2404–2413, https://doi.org/10.1093/bioinformatics/btac137

@article{10.1093/bioinformatics/btac137,
    author = {Jung, Youngmok and Han, Dongsu},
    title = "{BWA-MEME: BWA-MEM emulated with a machine learning approach}",
    journal = {Bioinformatics},
    volume = {38},
    number = {9},
    pages = {2404-2413},
    year = {2022},
    month = {03},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btac137},
    url = {https://doi.org/10.1093/bioinformatics/btac137},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/38/9/2404/43480985/btac137.pdf},
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].