Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → r-lib → Vroom

r-lib / Vroom

Licence: gpl-3.0

Fast reading of delimited files

Programming Languages

7636 projects

Labels

csv csv-parser tsv

Projects that are alternatives of or similar to Vroom

Intellij Csv Validator

CSV validator, highlighter and formatter plugin for JetBrains Intellij IDEA, PyCharm, WebStorm, ...

Stars: ✭ 198 (-57.14%)

Mutual labels: csv, tsv, csv-parser

Faster Than Csv

Faster CSV on Python 3

Stars: ✭ 52 (-88.74%)

Mutual labels: csv, tsv, csv-parser

Csv Parser

A modern C++ library for reading, writing, and analyzing CSV (and similar) files.

Stars: ✭ 359 (-22.29%)

Mutual labels: csv, csv-parser

VBA-CSV-interface

The most powerful and comprehensive CSV/TSV/DSV data management library for VBA, providing parsing/writing capabilities compliant with RFC-4180 specifications and a complete set of tools for manipulating records and fields.

Stars: ✭ 24 (-94.81%)

Mutual labels: csv, csv-parser

Pytablewriter

pytablewriter is a Python library to write a table in various formats: CSV / Elasticsearch / HTML / JavaScript / JSON / LaTeX / LDJSON / LTSV / Markdown / MediaWiki / NumPy / Excel / Pandas / Python / reStructuredText / SQLite / TOML / TSV.

Stars: ✭ 422 (-8.66%)

Mutual labels: csv, tsv

RecordParser

Zero Allocation Writer/Reader Parser for .NET Core

Stars: ✭ 155 (-66.45%)

Mutual labels: tsv, csv

flatpack

CSV/Tab Delimited and Fixed Length Parser and Writer

Stars: ✭ 55 (-88.1%)

Mutual labels: csv, csv-parser

qwery

A SQL-like language for performing ETL transformations.

Stars: ✭ 28 (-93.94%)

Mutual labels: tsv, csv

csvlixir

A CSV reading/writing application for Elixir.

Stars: ✭ 32 (-93.07%)

Mutual labels: csv, csv-parser

swiss-army knife for data

Stars: ✭ 275 (-40.48%)

Mutual labels: csv, tsv

Flatfiles

Reads and writes CSV, fixed-length and other flat file formats with a focus on schema definition, configuration and speed.

Stars: ✭ 275 (-40.48%)

Mutual labels: csv, tsv

Awesomecsv

🕶️A curated list of awesome tools for dealing with CSV.

Stars: ✭ 305 (-33.98%)

Mutual labels: csv, csv-parser

CsvTextFieldParser

A simple CSV parser based on Microsoft.VisualBasic.FileIO.TextFieldParser.

Stars: ✭ 40 (-91.34%)

Mutual labels: csv, csv-parser

csvtogs

Take a CSV file and create a Google Spreadsheet with the contents

Stars: ✭ 15 (-96.75%)

Mutual labels: csv, csv-parser

YouPlot

A command line tool that draw plots on the terminal.

Stars: ✭ 412 (-10.82%)

Mutual labels: tsv, csv

tabular-stream

Detects tabular data (spreadsheets, dsv or json, 20+ different formats) and emits normalized objects.

Stars: ✭ 34 (-92.64%)

Mutual labels: tsv, csv

jupyterlab-spreadsheet-editor

JupyterLab spreadsheet editor for tabular data (e.g. csv, tsv)

Stars: ✭ 72 (-84.42%)

Mutual labels: tsv, csv

Visidata

A terminal spreadsheet multitool for discovering and arranging data

Stars: ✭ 4,606 (+896.97%)

Mutual labels: csv, tsv

Pxi

🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.

Stars: ✭ 248 (-46.32%)

Mutual labels: csv, tsv

Miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

Stars: ✭ 4,633 (+902.81%)

Mutual labels: csv, tsv

View All Similar Projects ➔

🏎💨vroom

The fastest delimited reader for R, 1.48 GB/sec.

But that’s impossible! How can it be so fast?

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use. This lazy access is done automatically, so no changes to your R data-manipulation code are needed.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

package	version	time (sec)	speedup	throughput
vroom	1.3.0	1.11	67.13	1.48 GB/sec
data.table	1.13.0	13.12	5.67	125.19 MB/sec
readr	1.3.1	32.57	2.28	50.41 MB/sec
read.delim	4.0.2	74.37	1.00	22.08 MB/sec

Features

vroom has nearly all of the parsing features of readr for delimited and fixed width files, including

delimiter guessing*
custom delimiters (including multi-byte* and Unicode* delimiters)
specification of column types (including type guessing)
- numeric types (double, integer, big integer*, number)
- logical types
- datetime types (datetime, date, time)
- categorical types (characters, factors)
column selection, like dplyr::select()*
skipping headers, comments and blank lines
quoted fields
double and backslashed escapes
whitespace trimming
windows newlines
reading from multiple files or connections*
embedded newlines in headers and fields**
writing delimited files with as-needed quoting.
robust to invalid inputs (vroom has been extensively tested with the afl fuzz tester)*.

* these are additional features not in readr.

** requires num_threads = 1.

Installation

Install vroom from CRAN with:

install.packages("vroom")

Alternatively, if you need the development version from GitHub install it with:

# install.packages("devtools")
devtools::install_dev("vroom")

Usage

See getting started to jump start your use of vroom!

vroom uses the same interface as readr to specify column types.

vroom::vroom("mtcars.tsv",
  col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
                   drat = "_", vs = "l", am = "l", carb = "i")
)
#> # A tibble: 32 x 10
#>   model           mpg   cyl    hp    wt  qsec vs    am    gear   carb
#>   <chr>         <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl> <fct> <int>
#> 1 Mazda RX4      21       6   110  2.62  16.5 FALSE TRUE  4         4
#> 2 Mazda RX4 Wag  21       6   110  2.88  17.0 FALSE TRUE  4         4
#> 3 Datsun 710     22.8     4    93  2.32  18.6 TRUE  TRUE  4         1
#> # … with 29 more rows

Reading multiple files

vroom natively supports reading from multiple files (or even multiple connections!).

First we generate some files to read by splitting the nycflights dataset by airline.

library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(.x, glue::glue("flights_{.y}.tsv"), delim = "\t") }
)

Then we can efficiently read them into one tibble by passing the filenames directly to vroom.

files <- fs::dir_ls(glob = "flights*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv 
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv 
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv 
#> flights_YV.tsv
vroom::vroom(files)
#> Rows: 336,776
#> Columns: 19
#> Delimiter: "\t"
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time...
#> dttm [ 1]: time_hour
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>
#> 1  2013     1     1      810            810         0     1048           1037
#> 2  2013     1     1     1451           1500        -9     1634           1636
#> 3  2013     1     1     1452           1455        -3     1637           1639
#> # … with 336,773 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Learning more

Getting started with vroom
📽 vroom: Because Life is too short to read slow - Presentation at UseR!2019 (slides)
📹 vroom: Read and write rectangular data quickly - a video tour of the vroom features.

Benchmarks

The speed quoted above is from a real 1.53G dataset with 14,388,451 rows and 11 columns, see the benchmark article for full details of the dataset and bench/ for the code used to retrieve the data and perform the benchmarks.

Environment variables

In addition to the arguments to the vroom() function, you can control the behavior of vroom with a few environment variables. Generally these will not need to be set by most users.

VROOM_TEMP_PATH - Path to the directory used to store temporary files when reading from a R connection. If unset defaults to the R session’s temporary directory (tempdir()).
VROOM_THREADS - The number of processor threads to use when indexing and parsing. If unset defaults to parallel::detectCores().
VROOM_SHOW_PROGRESS - Whether to show the progress bar when indexing. Regardless of this setting the progress bar is disabled in non-interactive settings, R notebooks, when running tests with testthat and when knitting documents.
VROOM_CONNECTION_SIZE - The size (in bytes) of the connection buffer when reading from connections (default is 128 KiB).
VROOM_WRITE_BUFFER_LINES - The number of lines to use for each buffer when writing files (default: 1000).

There are also a family of variables to control use of the Altrep framework. For versions of R where the Altrep framework is unavailable (R < 3.5.0) they are automatically turned off and the variables have no effect. The variables can take one of true, false, TRUE, FALSE, 1, or 0.

VROOM_USE_ALTREP_NUMERICS - If set use Altrep for all numeric types (default false).

There are also individual variables for each type. Currently only VROOM_USE_ALTREP_CHR defaults to true.

VROOM_USE_ALTREP_CHR
VROOM_USE_ALTREP_FCT
VROOM_USE_ALTREP_INT
VROOM_USE_ALTREP_BIG_INT
VROOM_USE_ALTREP_DBL
VROOM_USE_ALTREP_NUM
VROOM_USE_ALTREP_LGL
VROOM_USE_ALTREP_DTTM
VROOM_USE_ALTREP_DATE
VROOM_USE_ALTREP_TIME

RStudio caveats

RStudio’s environment pane calls object.size() when it refreshes the pane, which for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes (RStudio#4210, RStudio#4292) for this issue, so it is recommended you use at least that version.

Thanks

Gabe Becker, Luke Tierney and Tomas Kalibera for conceiving, Implementing and maintaining the Altrep framework
Romain François, whose Altrepisode package and related blog-posts were a great guide for creating new Altrep objects in C++.
Matt Dowle and the rest of the Rdatatable team, data.table::fread() is blazing fast and great motivation to see how fast we could go faster!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 462

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (39) 🔗