All Projects → sd2k → ttv

sd2k / ttv

Licence: other
A command line tool for splitting files into test, train, and validation sets.

Programming Languages

rust
11053 projects
shell
77523 projects

Projects that are alternatives of or similar to ttv

Gltf Asset Generator
Tool for generating various glTF assets for importer validation
Stars: ✭ 103 (+171.05%)
Mutual labels:  validation, test
Split Folders
🗂 Split folders with files (i.e. images) into training, validation and test (dataset) folders
Stars: ✭ 203 (+434.21%)
Mutual labels:  validation, test
rasa-train-test-gha
A GitHub action to run easily rasa train and rasa test in the CIs.
Stars: ✭ 26 (-31.58%)
Mutual labels:  test, train
oolong
oolong 🍵 : create and administrate validation tests for typical automated content analysis tools.
Stars: ✭ 40 (+5.26%)
Mutual labels:  validation
teth
Testing and deployment framework for Ethereum smart contracts.
Stars: ✭ 31 (-18.42%)
Mutual labels:  test
walrus
🎉 Cli development framework.
Stars: ✭ 17 (-55.26%)
Mutual labels:  test
webargs-starlette
Declarative request parsing and validation for Starlette with webargs
Stars: ✭ 36 (-5.26%)
Mutual labels:  validation
onixcheck
ONIX validation library and commandline tool
Stars: ✭ 20 (-47.37%)
Mutual labels:  validation
openapi-schemas
JSON Schemas for every version of the OpenAPI Specification
Stars: ✭ 22 (-42.11%)
Mutual labels:  validation
laravel-pwned-passwords
Simple Laravel validation rule that allows you to prevent or limit the re-use of passwords that are known to be pwned (unsafe). Based on TroyHunt's Have I Been Pwned (https://haveibeenpwned.com)
Stars: ✭ 67 (+76.32%)
Mutual labels:  validation
alexa-skill-clean-code-template
Alexa Skill Template with clean code (eslint, sonar), testing (unit tests, e2e), multi-language, Alexa Presentation Language (APL) and In-Skill Purchases (ISP) support. Updated to ASK-CLI V2.
Stars: ✭ 34 (-10.53%)
Mutual labels:  test
fixture-monkey
Let Fixture Monkey generate test instances including edge cases automatically
Stars: ✭ 177 (+365.79%)
Mutual labels:  test
xrechnung-schematron
Schematron rules for the German CIUS (XRechnung) of EN16931:2017
Stars: ✭ 19 (-50%)
Mutual labels:  validation
bdd-for-c
A simple BDD library for the C language
Stars: ✭ 90 (+136.84%)
Mutual labels:  test
railrouter-sg
A progressive web app that lets you explore MRT and LRT rail routes in Singapore
Stars: ✭ 29 (-23.68%)
Mutual labels:  train
svelte-form
JSON Schema form for Svelte v3
Stars: ✭ 47 (+23.68%)
Mutual labels:  validation
osgi-test
Testing support for OSGi. Includes JUnit 4 and JUnit 5 support and AssertJ support.
Stars: ✭ 22 (-42.11%)
Mutual labels:  test
excel validator
Python script to validate data in Excel files
Stars: ✭ 14 (-63.16%)
Mutual labels:  validation
validation
Validation on Laravel 5.X|6.X|7.X|8.X
Stars: ✭ 26 (-31.58%)
Mutual labels:  validation
python-client
Python SDK client for Split Software
Stars: ✭ 12 (-68.42%)
Mutual labels:  split

Dependabot Status

ttv - create train, test, validation sets

ttv is a command line tool for splitting large files up into chunks suitable for train/test/validation splits for machine learning. It arose from the need to split files that were too large to fit into memory to split, and the desire to do it in a clean way.

ttv requires Rust 2021.

Installation

Build using cargo build --release to get a binary at ./target/release/ttv. Copy this into your path to use it.

Usage

Run ttv --help to get help, or infer what you can from one of these examples:

# Split CSV file into two sets of a fixed number of rows
$ ttv split data.csv --rows=train=9000 --rows=test=1000 --uncompressed-input

# Accepts gzipped data (no flag required). Shorthand argument version. As many splits as you like!
$ ttv split data.csv.gz --rows=train=65000,validation=15000,test=15000

# Alternatively, specify proportion-based splits. -u is shorthand for --uncompressed-input
$ ttv split data.csv --props=train=0.8,test=0.2 -u

# When using proportions, include the total rows to get a progress bar
$ ttv split data.csv --props=train=0.8,test=0.2 --total-rows=1234 -u

# Accepts data from stdin, compressed or not (must give a filename)
$ cat data.csv | ttv split --rows=test=10000,train=90000 --output-prefix data -u
$ cat data.csv.gz | ttv split --rows=test=10000,train=90000 --output-prefix data

# Using pigz for faster decompression
$ pigz -dc data.csv.gz | ttv split --prop=test=0.1,train=0.9 --chunk-size 5000 --output-prefix data -u

# Split outputs into chunks for faster writing/reading later
$ ttv split data.csv.gz --rows=test=100000,train=900000 --chunk-size 5000

# Write outputs uncompressed
$ ttv split data.csv.gz --prop=test=0.5,train=0.5 --uncompressed-output

# Reproducible splits using seed
$ ttv split data.csv.gz --prop=test=0.5,train=0.5 --chunk-size 1000 --seed 5330

Development

You'll need a recent version of the Rust nightly toolchain and Cargo. Then just hack away as normal.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].