All Projects → alexpreynolds → sample

alexpreynolds / sample

Licence: MIT, Unknown licenses found Licenses found MIT LICENSE Unknown COPYING
Performs memory-efficient reservoir sampling on very large input files delimited by newlines

Programming Languages

c
50402 projects - #5 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to sample

bioinf-commons
Bioinformatics library in Kotlin
Stars: ✭ 21 (-65.57%)
Mutual labels:  sampling, bed
soda
Python-based UCSC genome browser snapshot-taker and gallery-maker
Stars: ✭ 12 (-80.33%)
Mutual labels:  genomics, bed
cljam
A DNA Sequence Alignment/Map (SAM) library for Clojure
Stars: ✭ 85 (+39.34%)
Mutual labels:  genomics, bed
berokka
🍊 💫 Trim, circularise and orient long read bacterial genome assemblies
Stars: ✭ 23 (-62.3%)
Mutual labels:  genomics
covtobed
⛰ covtobed | Convert the coverage track from a BAM file into a BED file
Stars: ✭ 37 (-39.34%)
Mutual labels:  bed
wgd
Python package and CLI for whole-genome duplication related analyses
Stars: ✭ 68 (+11.48%)
Mutual labels:  genomics
biopython-coronavirus
Biopython Jupyter Notebook tutorial to characterize a small genome
Stars: ✭ 80 (+31.15%)
Mutual labels:  genomics
Mitty
Seven Bridges Genomics aligner/caller debugging and analysis tools
Stars: ✭ 13 (-78.69%)
Mutual labels:  genomics
bfc
High-performance error correction for Illumina resequencing data
Stars: ✭ 66 (+8.2%)
Mutual labels:  genomics
sequencework
programs and scripts, mainly python, for analyses related to nucleic or protein sequences
Stars: ✭ 22 (-63.93%)
Mutual labels:  genomics
odoviz
3D Odometry Visualization and Processing Tool
Stars: ✭ 24 (-60.66%)
Mutual labels:  sampling
GenomicsDB
Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.
Stars: ✭ 77 (+26.23%)
Mutual labels:  genomics
viGEN
viGEN - A bioinformatics pipeline for the exploration of viral RNA in human NGS data
Stars: ✭ 24 (-60.66%)
Mutual labels:  genomics
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (-63.93%)
Mutual labels:  genomics
simuG
simuG: a general-purpose genome simulator
Stars: ✭ 68 (+11.48%)
Mutual labels:  genomics
kmer-db
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
Stars: ✭ 68 (+11.48%)
Mutual labels:  genomics
genipe
Genome-wide imputation pipeline
Stars: ✭ 28 (-54.1%)
Mutual labels:  genomics
aws-genomics-workflows
Genomics Workflows on AWS
Stars: ✭ 131 (+114.75%)
Mutual labels:  genomics
metaRNA
Find target sites for the miRNAs in genomic sequences
Stars: ✭ 19 (-68.85%)
Mutual labels:  genomics
assigner
Population assignment analysis using R
Stars: ✭ 17 (-72.13%)
Mutual labels:  genomics

sample

Build Status

This tool performs reservoir sampling (Vitter, "Random sampling with a reservoir"; cf. http://dx.doi.org/10.1145/3147.3165 and also: http://en.wikipedia.org/wiki/Reservoir_sampling) on very large text files that are delimited by newline characters. Sampling can be done with or without replacement. The approach used in this application reduces the typical memory usage issue with reservoir sampling by storing a pool of byte offsets to the start of each line, instead of the line elements themselves, thus allowing much larger sample sizes.

In its current form, this application offers a few advantages over common shuf-based approaches:

  • On small k, it performs roughly 2.25-2.75x faster than shuf in informal tests on OS X and Linux hosts.
  • It uses much less memory than the usual reservoir sampling approach that stores a pool of sampled elements; instead, sample stores the start positions of sampled lines (8 bytes per element).
  • Using less memory gives sample an advantage over shuf for whole-genome scale files, helping avoid shuf: memory exhausted errors. For instance, a 2 GB allocation would allow a sample size up to ~268M random elements (sampling without replacement).

The sample tool stores a pool of line positions and makes two passes through the input file. One pass generates the sample of random positions, using a Mersenne Twister to generate uniformly random values, while the second pass uses those positions to print the sample to standard output. To minimize the expense of this second pass, we use mmap routines to gain random access to data in the regular input file on both passes.

The benefit that mmap provided was significant. For comparison purposes, we also add a --cstdio option to test the performance of the use of standard C I/O routines (fseek(), etc.); predictably, this performed worse than the mmap-based approach in all tests, but timing results were about identical with gshuf on OS X and still an average 1.5x improvement over shuf under Linux.

The sample tool can be used to sample from any text file delimited by newline characters (BED, SAM, VCF, etc.).

Additionally, the sample tool can be used with the --lines-per-offset option to sample multiples of lines from a text file. This can be useful for sampling from FASTA or FASTQ files, each with records that are formatted in two- or four-line groupings.

One can use the --rng-seed option to sample the same lines from a particular file. This can be useful for testing sample distributions, or for sampling paired-end reads in conjunction with --lines-per-offset.

By adding the --preserve-order option, the output sample preserves the input order. For example, when sampling from an input BED file that has been sorted by BEDOPS sort-bed — which applies a lexicographical sort on chromosome names and a numerical sort on start and stop coordinates — the sample will also have the same ordering applied, with a relatively small O(k logk) penalty for a sample of size k.

By default, sample performs sampling without replacement — a sampled element will not be resampled. Using --sample-with-replacement changes this behavior accordingly.

By omitting the sample size parameter, the sample tool can shuffle the entire file. This tool can be used to shuffle files that shuf has memory issues with; however, sample currently operates slower than shuf on shuffling whole files, when shuf can be used. We recommend use of shuf when shuffling an entire file, where possible, or specifying the --sample-size as the line count with sample, if known ahead of time (e.g., with wc -l or similar).

One downside at this time is that sample does not process a standard input stream; the input to sample must be a regular file. In contrast, the shuf tool can process a standard input stream.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].