All Projects → stefan-schroedl → tabulator

stefan-schroedl / tabulator

Licence: MIT license
A set of Unix shell command line tools for quick and convenient batch processing of tabular text files (a.k.a., tab-delimited, tsv, csv, or flat data file format) with a header line. Provides column reference by name, automatic delimiter and compression detection for per-line transformations, sql-like group-by operation and relational join.

Programming Languages

perl
6916 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to tabulator

Structured Text Tools
A list of command line tools for manipulating structured text data
Stars: ✭ 6,180 (+18076.47%)
Mutual labels:  tsv, delimited-files
vcf2tsv
Genomic VCF to tab-separated values
Stars: ✭ 27 (-20.59%)
Mutual labels:  tsv, tab-separated
Winmerge
WinMerge is an Open Source differencing and merging tool for Windows. WinMerge can compare both folders and files, presenting differences in a visual text format that is easy to understand and handle.
Stars: ✭ 2,358 (+6835.29%)
Mutual labels:  tsv, csv-files
Flatfiles
Reads and writes CSV, fixed-length and other flat file formats with a focus on schema definition, configuration and speed.
Stars: ✭ 275 (+708.82%)
Mutual labels:  tsv, csv-files
Intellij Csv Validator
CSV validator, highlighter and formatter plugin for JetBrains Intellij IDEA, PyCharm, WebStorm, ...
Stars: ✭ 198 (+482.35%)
Mutual labels:  tsv, csv-files
.config
⚙️ Bootstrappable user environment for macOS & Ubuntu
Stars: ✭ 31 (-8.82%)
Mutual labels:  unix
InitKit
Neo-InitWare is a modular, cross-platform reimplementation of the systemd init system. It is experimental.
Stars: ✭ 364 (+970.59%)
Mutual labels:  unix
comma splice
Fixes CSVs with unquoted commas in values
Stars: ✭ 67 (+97.06%)
Mutual labels:  csv-files
onionjuggler
Manage your Onion Services via CLI or TUI on Unix-like operating system with a POSIX compliant shell.
Stars: ✭ 31 (-8.82%)
Mutual labels:  unix
netpoll
Package netpoll implements a network poller based on epoll/kqueue.
Stars: ✭ 38 (+11.76%)
Mutual labels:  unix
shod
mouse-based window manager that can tile windows inside floating containers
Stars: ✭ 126 (+270.59%)
Mutual labels:  unix
GLPT
GLPT :: OpenGL Pascal Toolkit. A multi-platform library for OpenGL and OpenGL ES
Stars: ✭ 26 (-23.53%)
Mutual labels:  unix
subhook.nim
subhook wrapper for Nim https://github.com/Zeex/subhook
Stars: ✭ 15 (-55.88%)
Mutual labels:  unix
minishell
As beautiful as a shell
Stars: ✭ 124 (+264.71%)
Mutual labels:  unix
bonomen
BONOMEN - Hunt for Malware Critical Process Impersonation
Stars: ✭ 42 (+23.53%)
Mutual labels:  unix
unfs3
UNFS3 is a user-space implementation of the NFSv3 server specification.
Stars: ✭ 74 (+117.65%)
Mutual labels:  unix
smooth
The smooth Class Library
Stars: ✭ 23 (-32.35%)
Mutual labels:  unix
ncurses guide
NCurses Examples from the book "Programmer's Guide to NCurses" with improvements and fixes
Stars: ✭ 43 (+26.47%)
Mutual labels:  unix
BSDCoreUtils
BSD coreutils is a port of many utilities from BSD to Linux and macOS.
Stars: ✭ 30 (-11.76%)
Mutual labels:  unix
tush
No description or website provided.
Stars: ✭ 23 (-32.35%)
Mutual labels:  unix

Tabulator: Shell scripts for delimited data files

URL: https://github.com/stefan-schroedl/tabulator

Author: Stefan Schroedl

Date

  • 2015/04/03 release 1.2.1
  • 2015/03/21 release 1.2
  • 2014/10/12 release 1.1.2
  • 2012/01/24 release 1.1.1
  • 2011/11/25 release 1.1.0
  • 2009/06/16 release 1.0.0

Purpose

Unix/Linux comes with several tools, such as cut, paste, join, sort, to process tabular text data files (a.k.a., tab-delimited, csv, tsv, or flat file format). However, they have shortcomings that often requires additional scripting and prevents them from being directly used in one-liners. For example, having to count columns to pass as arguments is cumbersome and error-prone; join needs presorted files, and works with a single key column; there is no direct 'group-by' functionality.

One remedy would be to load the data into a relational database or noSQL system (e.g., Hadoop/Pig) first, but these might not be available or be more time-consuming for short, ad-hoc tasks.

Tabulator is a collection of command-line shell tools for Unix/Linux platforms that build on native shell programs and can be used as filters, but make them more easy and flexible to use. In particular, they

  • allow to reference columns by names rather than position, as indicated by the first line in a file; this makes scripts more readable and robust to changes in the input data format.
  • automate file format recognition (delimiter, compression etc).
  • check file format (e.g., consistent number of columns).
  • offer expanded functionality, such as SQL-like relational join and group-by operations.

Installation

Installation is easy -- just unpack the tarball, add the unpacked directory to your PATH.

License

Tabulator is licensed under the MIT LICENSE, see LICENSE for details.

Documentation

Here is a brief list of the programs together with their main functionality. Each one provides more documentation and examples when called with the -m or -h options. A common assumption is that the first line in input files contains the column names.

  • tblcat: concatenate files of the same data format without header repetition.
  • tblcmd: execute a program on the body of a file (e.g., sort, uniq), without affecting the header.
  • tbldesc: for each column, summarize its type (e.g., char, int, float), percentage of undefined values, min/max/mean/median/std, etc. Can also provide correlation coefficients with a target column.

Example: Suppose file is

    name,house_nr,height,shoe_size
    arthur,42,6.1,11.5
    berta,101,5.5,8
    chris,333,5.9,10
    don,77,5.9,12.5

Then tbldesc file prints:

    summarizing file_desc (4 lines, target column: shoe_size)
    field name     type char% uniq min max avg  std    mse  corr   prob%
    1 name       char 100    4 [arthur; berta; chris; don]
    2 house_nr   int    0    4 42   333 138   114    172   -0.287 71.25
    3 height     float  0    3 5.5  6.1 5.85  0.218  4.89   0.812 18.82
    4 shoe_size  float  0    4 8   12.5 10.5  1.7     0.0   1.0    0.00
  • tblmap: simple line-wise ("map") computation similar to awk.

Example: Compute ratio of columns sales and clients for lines where column region has value us:

tblmap -s'region=="us"' -c'sales_per_client=sales/client' file
  • tblred: compute ("reduce") aggregations (e.g., sum, min, max, avg, etc.) over groups of keys -- similar to the SQL group by operator.

Example:

tblred -k'region' 'sales_ratio=sales/sum(sales)'

computes for each line proportion of column sales to total sales for all lines with the same value of column region.

  • tbljoin: Relational join. In contrast to Unix join, input files don't need to be pre-sorted, and multiple join columns can be specified.

Example: Suppose file1 is

name,street,house
zorro,desert road,5
john,main st,2
arthur,pan-galactic bypass,42
arthur,main st,15

and file2 is

name,street,phone
john,main st,654-321
arthur,main st,121-212
john,round cir,123-456

Then tbljoin file1 file2 gives

name,street,house,phone
arthur,main st,15,121-212
john,main st,2,654-321
  • tblhist: computation and ascii-plotting of the histogram of column values.
  • tblsplit: split a file into several ones based on a column value.

Example: Suppose file is

    continent,country
    americas,us
    americas,mx
    europe,de
    europe,fr

Then tblsplit -rk'continent' file generates two files, file.select.continent=americas:

    country
    us
    mx

and file.select.continent=europe:

    country
    de
    fr
  • tblsort: interface for unix sort.
  • tbltex: formatting for latex tables.
  • tbltranspose: transposition of rows and columns.
  • tbluniq: check for and cut out duplicate columns; also, discover functional value dependencies.
  • tblcolumn: format columns in a more readable way, using the unix 'column' program, and aligning and shortening numbers.
  • tblless: page through formatted column output (calls tblcolumn).

Implementation Notes

  • The scripts have been developed over time to help with various data processing tasks, and were not designed from the outset to be released in one package. Therefore, some scripts are implemented in Python, and some in Perl; and there is a small amount of overlapping functionality.
  • For very large files, it is crucial to be able to process them in streaming mode, i.e., without keeping them entirely in main memory. This is the case for tblmap (since it translates into an cut/awk script) and for tbljoin (using sort and join), For tblred, presort the file first, then run it with option -s.
  • tblmap, tbljoin, and tblcmd work by first translating the command into a standalone shell script.

Limitations

  • There is no special interpretation of block delimiters like ' or "; it is the user's responsibility to ensure that the column delimiter cannot occur within column values.
  • tbljoin, tblred, tblhist, tbluniq, tblcat, and tbltex have some restrictions when run as a filter (repeated reading is necessary in some cases).
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].