All Projects → hroptatyr → sample

hroptatyr / sample

Licence: BSD-3-Clause License
Produce a sample of lines from files.

Programming Languages

c
50402 projects - #5 most used programming language
M4
1887 projects
Makefile
30231 projects
Roff
2310 projects

Projects that are alternatives of or similar to sample

editorconfig-cli
📝🔧 initialize .editorconfig in your terminal
Stars: ✭ 26 (+52.94%)
Mutual labels:  command-line-tool
yj
Command line tool that converts YAML to JSON
Stars: ✭ 62 (+264.71%)
Mutual labels:  command-line-tool
nycurl
A web server that fetches data from the New York Times and formats it for display in the terminal.
Stars: ✭ 27 (+58.82%)
Mutual labels:  command-line-tool
stream
Stream. Curating your streams (MIT) License
Stars: ✭ 15 (-11.76%)
Mutual labels:  stream
Pomodoro-Calculator
A pretty command line tool to calculate the number of pomodoros available between two points in time.
Stars: ✭ 20 (+17.65%)
Mutual labels:  command-line-tool
minimist2
TypeScript/JavaScript ES6 rewrite of popular Minimist argument parser
Stars: ✭ 20 (+17.65%)
Mutual labels:  command-line-tool
node-streamify
Streamify helps you easily provide a streaming interface for your code.
Stars: ✭ 51 (+200%)
Mutual labels:  stream
sane
make, but sane.
Stars: ✭ 15 (-11.76%)
Mutual labels:  command-line-tool
meros
🪢 A fast utility that makes reading multipart responses simple
Stars: ✭ 109 (+541.18%)
Mutual labels:  stream
videowall
Video wall with multiple tiles that enables synchronized video playback, mirrored or tiled.
Stars: ✭ 57 (+235.29%)
Mutual labels:  stream
goto
Goto - The Good Way to Program
Stars: ✭ 14 (-17.65%)
Mutual labels:  command-line-tool
Diffy
🎞️💓🍿 Love streaming - It's always best to watch a movie together ! 🤗
Stars: ✭ 37 (+117.65%)
Mutual labels:  stream
Deep-Inside
Command line tool that allows you to explore IoT devices by using Shodan API.
Stars: ✭ 22 (+29.41%)
Mutual labels:  command-line-tool
Streamator
A Spectator Specifically build for Content Creation and Streaming
Stars: ✭ 18 (+5.88%)
Mutual labels:  stream
cfdns
Command line tool for manipulating DNS of CloudFlare hosted domains
Stars: ✭ 20 (+17.65%)
Mutual labels:  command-line-tool
mongoose-gridfs
mongoose gridfs on top of new gridfs api
Stars: ✭ 79 (+364.71%)
Mutual labels:  stream
go
Go-based command-line tool for the remove.bg API
Stars: ✭ 91 (+435.29%)
Mutual labels:  command-line-tool
rotten tomatoes cli
Rotten Tomatoes CLI
Stars: ✭ 14 (-17.65%)
Mutual labels:  command-line-tool
pipecolor
A terminal filter to colorize output
Stars: ✭ 17 (+0%)
Mutual labels:  command-line-tool
pganonymize
A commandline tool for anonymizing PostgreSQL databases
Stars: ✭ 20 (+17.65%)
Mutual labels:  command-line-tool

sample

Status Conda Downloads Conda Version Platforms
Build Status Conda Downloads Conda Version Conda Platforms

Produce a sample of lines from files. The sample size is either fixed or proportional to the size of the file. Additionally, the header and footer can be included in the sample.

Red tape

  • no dependencies other than a POSIX system and a C99 compiler.
  • licensed under BSD3c

Features

  • proportional sampling of streams and files
  • header and footer can be included in the sample
  • reservoir sampling (fixed sample size) of streams and files
  • stable reservoir sampling (i.e. the order is preserved)

Motivation

Practically ubiquitous, there's shuf -n of GNU coreutils, a tool that, in principle, solves the problem at hand. However, shuf buffers all input and is therefore useless for files that don't fit in memory.

So, looking for alternatives one may come across paulgb's subsample or earino's fast_sample. They usually do the trick and everyone seems to agree (judged by github stars). However, both tools have short-comings: they try to make sense of the line data semantically, and secondly, they are slow!

The first issue is such a major problem that their bug trackers are full of reports. subsample needs lines to be UTF-8 strings and fast_sample wants CSV files whose correctness is checked along the way. This project's tool, sample, on the other hand does not care about the line's content, all it needs are those line breaks at the end.

The speed issue is addressed by

  • using the most appropriate programming language for the problem
  • using radix sort
  • using the PCG family to obtain randomness
  • oversampling

Examples

To get 10 random words from the words file:

$ sample -n 10 -H 0 /usr/share/dict/words
...
benzopyrene
calamondins
cephalothorax
copulate
garbology's
Kewadin
Peter's
reassembly
Vienna's
Wagnerism's
...

The -H 0 produces 0 lines of header output which defaults to 5.

For proportional sampling use -r|--rate:

$ wc -l /usr/share/dict/words
305089
$ sample -r 1% /usr/share/dict/words | wc -l
3080

which is close to the true result bearing in mind that by default the header and footer of the file is printed as well.

Sampling with a rate of 0 replaces awkward scripts that use multios and head and tail to produce the same result.

$ sample -r 0 /usr/share/dict/words
A
AA
AAA
Aachen
aah
...
Zyuganov
Zyuganov's
zyzzyva
zyzzyvas
ZZZ

Similar projects

In no particular order and without any claim to completeness:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].