All Projects → lh3 → Wgsim

lh3 / Wgsim

Reads simulator

Programming Languages

c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to Wgsim

Goleft
goleft is a collection of bioinformatics tools distributed under MIT license in a single static binary
Stars: ✭ 175 (-1.69%)
Mutual labels:  bioinformatics, genomics
Roary
Rapid large-scale prokaryote pan genome analysis
Stars: ✭ 176 (-1.12%)
Mutual labels:  bioinformatics, genomics
Circlator
A tool to circularize genome assemblies
Stars: ✭ 121 (-32.02%)
Mutual labels:  bioinformatics, genomics
Cooler
A cool place to store your Hi-C
Stars: ✭ 112 (-37.08%)
Mutual labels:  bioinformatics, genomics
Hifiasm
Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
Stars: ✭ 134 (-24.72%)
Mutual labels:  bioinformatics, genomics
Ngless
NGLess: NGS with less work
Stars: ✭ 115 (-35.39%)
Mutual labels:  bioinformatics, genomics
Sarek
Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing
Stars: ✭ 124 (-30.34%)
Mutual labels:  bioinformatics, genomics
Ariba
Antimicrobial Resistance Identification By Assembly
Stars: ✭ 96 (-46.07%)
Mutual labels:  bioinformatics, genomics
Octopus
Bayesian haplotype-based mutation calling
Stars: ✭ 131 (-26.4%)
Mutual labels:  bioinformatics, genomics
Hts Nim
nim wrapper for htslib for parsing genomics data files
Stars: ✭ 132 (-25.84%)
Mutual labels:  bioinformatics, genomics
Cgranges
A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)
Stars: ✭ 111 (-37.64%)
Mutual labels:  bioinformatics, genomics
Hgvs
Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
Stars: ✭ 138 (-22.47%)
Mutual labels:  bioinformatics, genomics
Genomics
A collection of scripts and notes related to genomics and bioinformatics
Stars: ✭ 101 (-43.26%)
Mutual labels:  bioinformatics, genomics
Hicexplorer
HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
Stars: ✭ 116 (-34.83%)
Mutual labels:  bioinformatics, genomics
Smudgeplot
Inference of ploidy and heterozygosity structure using whole genome sequencing data
Stars: ✭ 98 (-44.94%)
Mutual labels:  bioinformatics, genomics
Kmer Cnt
Code examples of fast and simple k-mer counters for tutorial purposes
Stars: ✭ 124 (-30.34%)
Mutual labels:  bioinformatics, genomics
Bio
Bioinformatics library for .NET
Stars: ✭ 90 (-49.44%)
Mutual labels:  bioinformatics, genomics
Gcp For Bioinformatics
GCP Essentials for Bioinformatics Researchers
Stars: ✭ 95 (-46.63%)
Mutual labels:  bioinformatics, genomics
Somalier
fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
Stars: ✭ 128 (-28.09%)
Mutual labels:  bioinformatics, genomics
Artemis
Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation
Stars: ✭ 135 (-24.16%)
Mutual labels:  bioinformatics, genomics

Introduction

Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms.

Wgsim outputs the simulated polymorphisms, and writes the true read coordinates as well as the number of polymorphisms and sequencing errors in read names. One can evaluate the accuracy of a mapper or a SNP caller with wgsim_eval.pl that comes with the package.

Compilation

gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm

History

Wgsim was modified from MAQ's read simulator by dropping dependencies to other source codes in the MAQ package and incorporating patches from Colin Hercus which allow to simulate INDELs longer than 1bp. Wgsim was originally released in the SAMtools software package. I forked it out in 2011 as a standalone project. A few improvements were also added in this course.

Evaluation

Simulation and evaluation

The command line for simulation:

wgsim -Nxxx -1yyy -d0 -S11 -e0 -rzzz hs37m.fa yyy-zzz.fq /dev/null

where yyy is the read length, zzz is the error rate and $xxx * $yyy = 10000000. By default, 15% of polymorphisms are INDELs and their lengths are drawn from a geometric distribution with density 0.7*0.3^{l-1}.

The command line for evaluation:

wgsim_eval.pl unique aln.sam | wgsim_eval.pl alneval -g 20

The '-g' option may be changed with mappers.

System

GCC: 4.1.2 CPU: AMD Opteron 8350 @ 2.0GHz Mem: 128GB

Results

================================================================================================================== 100bp 200bp 500bp 1000bp 10000bp ------------------ ----------------- ----------------- ----------------- ----------------- Program Metrics 2% 5% 10% 2% 5% 10% 2% 5% 10% 2% 5% 10% 2% 5% 10%

        CPU      249   198   136    325   262   163    332   243   232    320   235   215    235   197   189

BWA-SW Q20% 85.1 63.6 21.4 93.7 88.9 53.5 96.4 95.7 89.2 96.6 96.2 95.1 97.7 98.3 97.7 err% 0.01 0.06 0.20 0.00 0.01 0.14 0.00 0.01 0.01 0.00 0.00 0.01 0.00 0.00 0.00 one% 94.6 77.4 35.7 97.5 95.1 67.6 98.6 98.5 93.4 99.0 98.9 98.3 99.7 99.8 99.7

        CPU                                            302   484  1060    330   352   607    381   480   919

AGILE Q20% 98.6 98.4 98.4 98.4 98.4 98.6 98.2 98.6 99.3 err% 0.66 0.69 2.31 0.34 0.40 0.70 0.10 0.00 0.20 one% 100 99.4 0 100 100 100 100 100 100

  1. AGILE throws "Floating point exception" halfway for 100/200bp reads. The default output is supposed to be PSL, but actually has an additional "score" column. AGILE is reportedly faster than BWA-SW for 1000bp reads. It is slower here possibly because of suboptimal command line options.

  2. Gassst uses over 27GB memory in 20 minutes. The memory then quickly increases to over 40GB. It gets killed.

  3. Lastz complains: "FAILURE: bad fasta character in hs37m.fa ...".

  4. Pash only gives 'unique mapping'. Its unique mapping is better than BWA-SW's Q1 mapiping. It is very slow, though, possibly because of suboptimal options.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].