All Projects → merenlab → Reads For Assembly

merenlab / Reads For Assembly

Licence: gpl-2.0
Simulate metagenomic short reads from one or more populations.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Reads For Assembly

Webgl Wind
Wind power visualization with WebGL particles
Stars: ✭ 601 (+3435.29%)
Mutual labels:  simulation
Mmcamscanner
Simulation of CamScanner app With Custom Camera and Crop Rect Validation
Stars: ✭ 728 (+4182.35%)
Mutual labels:  simulation
Citybound
A work-in-progress, open-source, multi-player city simulation game.
Stars: ✭ 6,646 (+38994.12%)
Mutual labels:  simulation
Handright
A lightweight Python library for simulating Chinese handwriting
Stars: ✭ 634 (+3629.41%)
Mutual labels:  simulation
Abstreet
Transportation planning and traffic simulation software for creating cities friendlier to walking, biking, and public transit
Stars: ✭ 6,355 (+37282.35%)
Mutual labels:  simulation
Reactphysics3d
Open source C++ physics engine library in 3D
Stars: ✭ 730 (+4194.12%)
Mutual labels:  simulation
Quirk
A drag-and-drop quantum circuit simulator that runs in your browser. A toy for exploring and understanding small quantum circuits.
Stars: ✭ 593 (+3388.24%)
Mutual labels:  simulation
Splatter Paper
Data and analysis for the Splatter paper
Stars: ✭ 17 (+0%)
Mutual labels:  simulation
Attack range
A tool that allows you to create vulnerable instrumented local or cloud environments to simulate attacks against and collect the data into Splunk
Stars: ✭ 693 (+3976.47%)
Mutual labels:  simulation
Openmm
OpenMM is a toolkit for molecular simulation using high performance GPU code.
Stars: ✭ 791 (+4552.94%)
Mutual labels:  simulation
Gridfluidsim3d
A PIC/FLIP fluid simulation based on the methods found in Robert Bridson's "Fluid Simulation for Computer Graphics"
Stars: ✭ 653 (+3741.18%)
Mutual labels:  simulation
Awesome Open Geoscience
Curated from repositories that make our lives as geoscientists, hackers and data wranglers easier or just more awesome
Stars: ✭ 668 (+3829.41%)
Mutual labels:  simulation
Splishsplash
SPlisHSPlasH is an open-source library for the physically-based simulation of fluids.
Stars: ✭ 735 (+4223.53%)
Mutual labels:  simulation
Deepdrive
Deepdrive is a simulator that allows anyone with a PC to push the state-of-the-art in self-driving
Stars: ✭ 628 (+3594.12%)
Mutual labels:  simulation
Spritekit Leaf Simulation
A simulation of leaves blowing in the wind created with SpriteKit.
Stars: ✭ 6 (-64.71%)
Mutual labels:  simulation
Dart
Dynamic Animation and Robotics Toolkit
Stars: ✭ 596 (+3405.88%)
Mutual labels:  simulation
Su2
SU2: An Open-Source Suite for Multiphysics Simulation and Design
Stars: ✭ 731 (+4200%)
Mutual labels:  simulation
Uav coverage
Simulation for planar area coverage by a swarm of UAVs
Stars: ✭ 17 (+0%)
Mutual labels:  simulation
Causal Inference For Data Scientists
Notebooks of Python and R code which illustrates basic causal inference using simulated data
Stars: ✭ 17 (+0%)
Mutual labels:  simulation
Shadow
Shadow is a unique discrete-event network simulator that runs real applications like Tor, and distributed systems of thousands of nodes on a single machine. Shadow combines the accuracy of emulation with the efficiency and control of simulation, achieving the best of both approaches.
Stars: ✭ 769 (+4423.53%)
Mutual labels:  simulation

Simulating Metgenomic Short Reads for Assembly, Mapping, and Binning experiments

Scripts to generate random short reads from a given set of contigs, so they can be assembled back (to test assemblers, or genome binning software, or more).

Using the config files and scripts,

  • You can generate single reads in a FASTA-formatted output file.
  • You can generate paired-end reads in FASTQ-formatted output files. You can define an insert size with a standard deviation.
  • You can generate reads from a single FASTA file with a single contig, single FASTA file with multiple contigs, or you can simulate a complex metagenomic mixture using multiple FASTA files each of with with multiple contigs.
  • You can specify a desired average coverage for each entry, and define an expected error rate for the sequencer.

The tool (both with respect to its functionality, and design) can be improved dramatically, but we only pushed it as much as what we needed from it. If you need something that is not here, please feel free to open an issue.

Note: You will need anvi'o to be installed on your system. Sorry! Find anvi'o installation instructions here: http://merenlab.org/software/anvio

Generating single short reads in a FASTA file

Say, you have one or more FASTA files each of which contains one or more contigs, and you wish to create a single FASTA file that contains enough number of short reads randomly created from these contigs to meet your expected coverage when they are assembled.

The program gen-single-reads allows you to generate those data.

For a little demonstration there is an examles/files directory in the repository that includes 5 FASTA files:

$ ls examples/files/*fa
examples/files/fasta_01.fa
examples/files/fasta_02.fa
examples/files/fasta_03.fa
examples/files/fasta_04.fa
examples/files/fasta_05.fa

Each FASTA file is sampled from a different bacterial genome, and contains 5 contigs that are about 7,500nts long.

Here is an example config file you can use with gen-single-reads:

[general]
output_file = short_reads.fa
short_read_length = 100
error_rate = 0.05
    
[files/fasta_01.fa]
coverage = 100
    
[files/fasta_02.fa]
coverage = 200
    
[files/fasta_03.fa]
coverage = 50
    
[files/fasta_04.fa]
coverage = 150
    
[files/fasta_05.fa]
coverage = 50

For each section except the general section, the config file simply says;

"Generage enough 100nt long short reads from all entries you find in sample/fasta_01.fa, so when they are assembled and short reads mapped back to resulting contigs, the average coverage would be about 100X for each contig in sample/fasta_01.fa. While doing this, introduce some random base errors to meet 0.005 error rate in average. In fact do the same for every other entry in this config file with respect to their required coverage values, and store all resulting reads in short_reads.fa".

Once you have your config file ready, this is how you run it:

$ ./gen-single-reads examples/single-reads-example.ini
fasta_01 w/ 5 contigs : 3,800 reads with 1,870 errors (avg 0.0049) for 10X avg cov.
fasta_02 w/ 5 contigs : 76,000 reads with 37,831 errors (avg 0.0050) for 200X avg cov.
fasta_03 w/ 5 contigs : 19,000 reads with 9,536 errors (avg 0.0050) for 50X avg cov.
fasta_04 w/ 5 contigs : 57,000 reads with 28,489 errors (avg 0.0050) for 150X avg cov.
fasta_05 w/ 5 contigs : 19,000 reads with 9,485 errors (avg 0.0050) for 50X avg cov.
Fasta output ...: short_reads.fa

There it is. So in theory, if you assemble short_reads.fa, you should recover contigs that are about 50, 100, 150 and 200X coverage. Just for fun, I used velvet to assemble what is in the short_reads.fa, then used [bowtie2](http://bowtie- bio.sourceforge.net/bowtie2/index.shtml) to map reads back to resulting contigs, and analyzed the mapping.

Here what I did step by step (excuse me for the crudeness of my demo and directly copy-pasting from my command line):

$ mkdir velvet_output/
$ velveth velvet_output/ 17 short_reads.fa
$ velvetg velvet_output/
$ bowtie2-build velvet_output/contigs.fa contigs_ref
$ bowtie2 -f short_reads.fa -x contigs_ref -S output.sam
209000 reads; of these:
  209000 (100.00%) were unpaired; of these:
    4070 (1.95%) aligned 0 times
    204930 (98.05%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
98.05% overall alignment rate
$ samtools view -bS output.sam > output.bam

I further analyzed the BAM file in an ad hoc manner with in-house scripts. Here is the average coverage of each contig in output.bam file:

average_coverage

So randomly generated and assembled short reads did creaete expected coverage profiles.

Here is the detailed coverage of one of those contigs that is at 50X coverage:

average_coverage

And another example from a contig that is covered about 200X:

average_coverage

Generating paired-end reads in FASTQ R1/R2 files.

Everything is the same, except this time you will use the program gen-paired-end-reads with a config file that looks like this one:

[general]
output_sample_name = test_sample
insert_size = 30
insert_size_std = 1
short_read_length = 100
error_rate = 0.05

[files/fasta_01.fa]
coverage = 10

[files/fasta_02.fa]
coverage = 20

[files/fasta_03.fa]
coverage = 5

[files/fasta_04.fa]
coverage = 15

[files/fasta_05.fa]
coverage = 5

insert_size is different than the real insert size, and describes the gap between the end of read 1 and the beginning of read 2.

Here is an example run:

$ ./gen-paired-end-reads examples/paired-end-reads-example.ini
Read lenth ...............: 100
Insert size ..............: 30
Insert size std ..........: 1.0
fasta_01 w/ 5 contigs ....: 3,800 reads in 1900.0 pairs with 1,986 errors (avg 0.0052) for 10X avg cov.
fasta_02 w/ 5 contigs ....: 7,600 reads in 3800.0 pairs with 3,816 errors (avg 0.0050) for 20X avg cov.
fasta_03 w/ 5 contigs ....: 1,900 reads in 950.0 pairs with 924 errors (avg 0.0049) for 5X avg cov.
fasta_04 w/ 5 contigs ....: 5,700 reads in 2850.0 pairs with 2,844 errors (avg 0.0050) for 15X avg cov.
fasta_05 w/ 5 contigs ....: 1,900 reads in 950.0 pairs with 979 errors (avg 0.0052) for 5X avg cov.
FASTQ R1 .................: test_sample-R1.fastq
FASTQ R2 .................: test_sample-R2.fastq

And this is how these files look like (those A's are made up quality scores :/ it can easily be improved if you need more complex Q-score simulations):

$ head -n 8 test_sample-R*
==> test_sample-R1.fastq <==
@fasta_01:23:B02CBACXX:8:2315:9436:8855 1:N:0:GATCAG
AAAGTGAGCGATACAAAGCTAAAGCCATCAACCAAAAAGCCCGCAACCCGCAAACCAGCTCCGCANACATCGGAGCTATCCAAGGACGATGACTNCCTGT
+source:640612206 slice:0-7600; start:5360; stop:5460; insert_size:32
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@fasta_01:23:B02CBACXX:8:2315:7851:9246 1:N:0:GATCAG
TGATGAGGCCTTCGTGGCCCAAGAGCGGACCAGCCGTAACTACGACCTGGTCGCCGTCTTTTAGGGCCTCGCTCATGGGCACCACGCGGTTTCCCCTATT
+source:640612206 slice:0-7600; start:2274; stop:2374; insert_size:31
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

==> test_sample-R2.fastq <==
@fasta_01:23:B02CBACXX:8:2315:9436:8855 2:N:0:GATCAG
TAAATTCACCAATGACATGGACACTCTTTACGTCGGGCGCCCAAACGCAAATGCGCCAGCCGCGCGTGCGCTGCTGCGTGTCGGGATTCGCGCCGATCTT
+source:640612206 slice:0-7600; start:5492; stop:5592; insert_size:32
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@fasta_01:23:B02CBACXX:8:2315:7851:9246 2:N:0:GATCAG
GCGATACGGCAGATCCTCGGACCGTGCAACAGTACCTGCTGCGCATGGATGACTTTGCCCGGGTGCTTTCACAGGACGGTCAGTTTGTGCCGTTGGCAAA
+source:640612206 slice:0-7600; start:2405; stop:2505; insert_size:31
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Need more from this tool?

It is indeed going to be useful for my purposes, but I would like it to be useful to you as well. So if this is what you need to do benchmarks or to test your stuff, and if its missing something for your purposes, please don't hesitate to get in touch with me.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].