All Projects → hrbrmstr → Ndjson

hrbrmstr / Ndjson

Licence: other
♨️ Wicked-Fast Streaming 'JSON' ('ndjson') Reader in R

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Ndjson

Jqr
R interface to jq
Stars: ✭ 123 (+179.55%)
Mutual labels:  json, rstats
D3r
d3.js helpers for R
Stars: ✭ 133 (+202.27%)
Mutual labels:  json, rstats
Elastic
R client for the Elasticsearch HTTP API
Stars: ✭ 227 (+415.91%)
Mutual labels:  json, rstats
Pretty Print Json
🦋 Pretty-print JSON data into HTML to indent and colorize (written in TypeScript)
Stars: ✭ 41 (-6.82%)
Mutual labels:  json
Pantry
🥑 Free data storage as a service that allows devs to store JSON for multiple apps & users. A good resource when building personal projects, apps for hackathons, and prototypes alike.
Stars: ✭ 42 (-4.55%)
Mutual labels:  json
I18nplugin
Intellij idea i18next support plugin
Stars: ✭ 43 (-2.27%)
Mutual labels:  json
Oakdex Pokedex
Ruby Gem and Node Package for comprehensive Generation 1-7 Pokedex data, including 809 Pokémon, uses JSON schemas to verify the data
Stars: ✭ 44 (+0%)
Mutual labels:  json
Inferregex
Infer the regular expression (regex) of a string 🔤 🔢 🔍
Stars: ✭ 41 (-6.82%)
Mutual labels:  rstats
Jsonview
A web extension that helps you view JSON documents in the browser.
Stars: ✭ 1,021 (+2220.45%)
Mutual labels:  json
Jsonj
A fluent Java API for manipulating json data structures
Stars: ✭ 42 (-4.55%)
Mutual labels:  json
Mkvtoolnix Batch
Windows Batch script to automate batch processing using mkvtoolnix.
Stars: ✭ 42 (-4.55%)
Mutual labels:  json
Soccergraphr
Soccer Analytics in R using OPTA data
Stars: ✭ 42 (-4.55%)
Mutual labels:  rstats
Wc3maptranslator
Translate war3map ⇄ json formats for WarCraft III .w3x maps
Stars: ✭ 43 (-2.27%)
Mutual labels:  json
Rtweet Workshop
Slides and code for the rtweet workshop
Stars: ✭ 41 (-6.82%)
Mutual labels:  rstats
Dito
Dito.js is a declarative and modern web framework with a focus on API driven development, based on Objection.js, Koa.js and Vue.js – Released in 2018 under the MIT license, with support by Lineto.com
Stars: ✭ 44 (+0%)
Mutual labels:  json
Dance
tibble() dancing 💃
Stars: ✭ 41 (-6.82%)
Mutual labels:  rstats
Uvicorn Gunicorn Fastapi Docker
Docker image with Uvicorn managed by Gunicorn for high-performance FastAPI web applications in Python 3.6 and above with performance auto-tuning. Optionally with Alpine Linux.
Stars: ✭ 1,014 (+2204.55%)
Mutual labels:  json
Goloc
A flexible tool for application localization using Google Sheets.
Stars: ✭ 42 (-4.55%)
Mutual labels:  json
Chinese Xinhua
📙 中华新华字典数据库。包括歇后语,成语,词语,汉字。
Stars: ✭ 8,705 (+19684.09%)
Mutual labels:  json
Squirrel Json
A vectorized JSON parser for pre-validated, minified documents
Stars: ✭ 43 (-2.27%)
Mutual labels:  json

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Signed by Signed commit % Linux build Status Coverage Status cran checks CRAN status Minimal R Version License

ndjson

Wicked-Fast Streaming ‘JSON’ (‘ndjson’) Reader

Description

Streaming ‘JSON’ (‘ndjson’) has one ‘JSON’ record per-line and many modern ‘ndjson’ files contain large numbers of records. These constructs may not be columnar in nature, but it is often useful to read in these files and “flatten” the structure out to enable working with the data in an R ‘data.frame’-like context. Functions are provided that make it possible to read in plain ‘ndjson’ files or compressed (‘gz’) ‘ndjson’ files and either validate the format of the records or create “flat” ‘data.table’ structures from them.

Pretty much an Rcpp/C++14 wrapper for https://github.com/nlohmann/json

The goal is to create a completely “flat” data.frame-like structure from ndjson records in plain text ndjson files or gzip’d ndjson files.

Installation guidance for Linux/BSD-ish systems

CRAN has binaries for Windows and macOS. To build this on UNIX-like systems, you need at least g++4.9 or clang++. This is a forced requirement by the ndjson library.

The least painful way to do this is to install gcc >= 4.9 (and you should install ccache while you’re at it) and mmodfiy ~/.R/Makevars thusly:

# Use whatever version of (g++ >=4.9 or clang++) that you downloaded
VER=-4.9
CC=ccache gcc$(VER)
CXX=ccache g++$(VER)
SHLIB_CXXLD=g++$(VER)
FC=ccache gfortran
F77=ccache gfortran

Why ndjson + Examples

An example of such files are the output from Rapid7 internet-wide scans, such as their HTTPS study. A gzip’d extract of 100,000 of one of those scans weighs in abt about 171MB. The records sometimes contain heavily nested JSON elements depending on how comprehensive the certificate data and other fields were. A typical record will look like this:

{
  "vhost": "teamchat.buzzpoints.com",
  "host": "52.87.143.83",
  "certsubject": {
    "CN": "teamchat.buzzpoints.com"
  },
  "ip": "52.87.143.83",
  "data": "SFRUUC8xLjEgMjAwIE9LDQpTZXJ2ZXI6IG5naW54LzEuNC42IChVYnVudHUpDQpEYXRlOiBNb24sIDIyIEF1ZyAyMDE2IDE3OjE3OjAwIEdNVA0KQ29udGVudC1UeXBlOiB0ZXh0L2h0bWw7IGNoYXJzZXQ9dXRmLTgNClRyYW5zZmVyLUVuY29kaW5nOiBjaHVua2VkDQpDb25uZWN0aW9uOiBjbG9zZQ0KVmFyeTogQWNjZXB0LUVuY29kaW5nDQpYLVBvd2VyZWQtQnk6IEV4cHJlc3MNClN0cmljdC1UcmFuc3BvcnQtU2VjdXJpdHk6IG1heC1hZ2U9NjMwNzIwMDA7IGluY2x1ZGVTdWJkb21haW5zOyBwcmVsb2FkDQpYLUZyYW1lLU9wdGlvbnM6IERFTlkNClgtQ29udGVudC1UeXBlLU9wdGlvbnM6IG5vc25pZmYNCkNvbnRlbnQtRW5jb2Rpbmc6IGd6aXANCg0KNTVjDQofiwgAAAAAAAADrVdbb5tIFH5ufgVlFanVFsNwZ2u7ap10N6tuE7nOqvtkDcPBngQYFsYu6a/fA/iCE8ci0j5gi5nvfOd+Zhi+vriezP65uVSWMk3GZ8PtH9BofKYow4Rn90oByUgt5UMC5RJAqop8yGGkSqikzspSVVhCy3KkzucpSBCFhovzuaosC4hHqg4GiV3f9v3IsYAZcewRMCJiBZZNTMsKPBITEofxAMU+tAzzmqGAUqwKBiNZrEAd/0/WyCWk0KgipgchGhISJ7ZjYtHANwzDsplrOyGzqG0Rw6CUquOzs7NhyQqey67rd3RN21V1vHV9XqwyyVOYM5HFfDGfKyPlz2/XXwc5LUp4EwETEdxOryYizUUGmXyjnnufzk2z9XsKCdAS8P3c+oi/f13OLq+n57ZBBuaA1MvmBH9vbj99uZrMv13OZldff/+2gSOPd9ECptfXs/nt9MtmxzSXUuZlw/n53PwsgaZsSeUgXP38mQueyXLARLrj34rPbz7O/pjfTC8/X33fUe1QlDGB3paTxtUJTRKIWlSdsNYQupJilUdUwt9QlFxkOxov8GLDZrFtOUEMvmEDlgSYnu+5rmfEAXguJczdO/2EagoxlsiShsk+YK7nYOKAmMyyXcNzAz8w/TCwfZdaoUsj0/SsyAncvROPDZyIIhJrurOTxq4RmBHWbhg74GLJeKFDQjfyIKCmFWFhh07oxbWAd6G+fft+qLdVgWWDNXuybpSyYHWH+L4PIWMmw5o0DDPwTayUmFBKbMNFLa6JHhFncLdrkLsn/dFRi+UquUxgPBXsHuRggrke6u3S2ash1hpVMP9YkXKkrmSs+aqij7c7da1o8O+Kr0cqlrHEKtXqjsc+b982rV/Pivc7npM0UOUcc9Vh0MhzKr9rtx+1uj+o5JjajszV5QiiBY6CraUZTXEOxQVdpGhkB/m6S96iIl7KgocriUXYQS4SEdLkKbxA7dmiC4QMimPINYcfuSi66n/wSC5HEaw5A615eafwjEtOE61kNIEReaektOLpKt0vrEoomre6okeZeGpUWtI8TzhD20SmzXgCE5GIomPlL4ZtW249sjZpbp1/KniVUozkPqM6rxdKPRELoaelRG6N2HaFzyDPFh/WI+s0aTvwnmMMC/ED3WtBgypNjhE2o1ljJ1zaH0cpzXgMJUZ9c8oc2L/ZxH4R2U7TXpgtC5FiZiAspShA4xLSLVEzKX/T9RYzWAixSPC8EKm+hesR9g9P9EywOMyy9C7LovuQ5/c0FFGGJ+QdhwVjcROvvVKOzqtKyX8CHpU0e9geo43herle/Iph2Vqh44EKstRjijUksgFuH/HjgNJ03AqfQ1pM3TOUc8QeZPYZS0lgVvj0pkVsL1rTr4iJc6e9S7RBOGEtYvvQBm4V9A9B0CsCrl25dm9D3cN+ea1p2IrPxNb2K/tECLolvSkErRHpEwnLrKwTWTvG3Yj04SZuRU5E+Rh3I9Ll1rR6Ru0DUy5xhrKVVA6KuhFTGsOUI9GqtBa9mQGPmgb3mqZpz7a9qnqIgoYXE7bcyG+60vEqx9v1S9eNxyJaA+3603HlMXjX9a5RuUY//gb6Un7PrDzM+ZGJ+NgkrYG+mN+tPMx7L/4a+lJ+QvDAIdhrfTRswC/WYRo4eHpmAYE1+MU62oOzpx9HTtketUocnMtOz2xvwC/2w0f3/b6xasEdHUN92XxHDvFcfMDb8KthxNfbj8UcrxtaImhUX7NwFBxsljnP8LrVrB9sFMAkUcdDHZlqoSeb5qlNvMI8L2mf2nQ6m1uKzT9etvXWQfS3+Yr+DxJBEERWDwAADQowDQoNCg==",
  "port": "443"
}

A system.time(df <- stream_in("https-extract.json.gz")) results in:

   user  system elapsed 
 14.822   0.224  15.189 

on a 13" MacBook Pro and produces:

Classes ‘data.table’ and 'data.frame': 100000 obs. of  36 variables:
 $ certsubject.CN                 : chr  "*.tio.ch" "*.starwoodhotels.com" "a.ssl.fastly.net" "a.ssl.fastly.net" ...
 $ data                           : chr  "SFRUUC8xLjEgNDAzIEZvcmJpZGRlbg0KU2VydmVyOiBjbG91ZGZsYXJlLW5naW54DQpEYXRlOiBNb24sIDIyIEF1ZyAyMDE2IDE3OjE2OjE2IEdNVA0KQ29udGVudC1"| __truncated__ "SFRUUC8xLjAgNDAwIEJhZCBSZXF1ZXN0DQpTZXJ2ZXI6IEFrYW1haUdIb3N0DQpNaW1lLVZlcnNpb246IDEuMA0KQ29udGVudC1UeXBlOiB0ZXh0L2h0bWwNCkNvbnR"| __truncated__ "SFRUUC8xLjEgNTAwIERvbWFpbiBOb3QgRm91bmQNClNlcnZlcjogVmFybmlzaA0KUmV0cnktQWZ0ZXI6IDANCmNvbnRlbnQtdHlwZTogdGV4dC9odG1sDQpDYWNoZS1"| __truncated__ "SFRUUC8xLjEgNTAwIERvbWFpbiBOb3QgRm91bmQNClNlcnZlcjogVmFybmlzaA0KUmV0cnktQWZ0ZXI6IDANCmNvbnRlbnQtdHlwZTogdGV4dC9odG1sDQpDYWNoZS1"| __truncated__ ...
 $ host                           : chr  "104.20.28.6" "104.80.186.186" "151.101.255.54" "151.101.158.15" ...
 $ ip                             : chr  "104.20.28.6" "104.80.186.186" "151.101.255.54" "151.101.158.15" ...
 $ port                           : chr  "443" "443" "443" "443" ...
 $ vhost                          : chr  "104.20.28.6" "104.80.186.186" "a.ssl.fastly.net" "a.ssl.fastly.net" ...
 $ certsubject.C                  : chr  NA "US" "US" "US" ...
 $ certsubject.L                  : chr  NA "Stamford" "San Francisco" "San Francisco" ...
 $ certsubject.O                  : chr  NA "STARWOOD HOTELS AND RESORTS WORLDWIDE, INC." "Fastly, Inc." "Fastly, Inc." ...
 $ certsubject.OU                 : chr  NA "IT Solutions" NA NA ...
 $ certsubject.ST                 : chr  NA "Connecticut" "California" "California" ...
 $ certsubject.emailAddress       : chr  NA NA NA NA ...
 $ certsubject.UNDEF              : chr  NA NA NA NA ...
 $ certsubject.businessCategory   : chr  NA NA NA NA ...
 $ certsubject.postalCode         : chr  NA NA NA NA ...
 $ certsubject.serialNumber       : chr  NA NA NA NA ...
 $ certsubject.street             : chr  NA NA NA NA ...
 $ certsubject.SN                 : chr  NA NA NA NA ...
 $ certsubject.unstructuredName   : chr  NA NA NA NA ...
 $ certsubject.ITU-T              : chr  NA NA NA NA ...
 $ certsubject.GN                 : chr  NA NA NA NA ...
 $ certsubject.description        : chr  NA NA NA NA ...
 $ certsubject.subjectAltName     : chr  NA NA NA NA ...
 $ certsubject.name               : chr  NA NA NA NA ...
 $ certsubject.DC                 : chr  NA NA NA NA ...
 $ certsubject.postOfficeBox      : chr  NA NA NA NA ...
 $ certsubject.dnQualifier        : chr  NA NA NA NA ...
 $ certsubject.generationQualifier: chr  NA NA NA NA ...
 $ certsubject.initials           : chr  NA NA NA NA ...
 $ certsubject.pseudonym          : chr  NA NA NA NA ...
 $ certsubject.title              : chr  NA NA NA NA ...
 $ certsubject                    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ certsubject.unstructuredAddress: chr  NA NA NA NA ...
 $ certsubject.UID                : chr  NA NA NA NA ...
 $ certsubject.mail               : chr  NA NA NA NA ...
 $ certsubject.Mail               : chr  NA NA NA NA ...
 - attr(*, ".internal.selfref")=<externalptr> 

All of the certificate sub-field data elements have been expanded and we have a highly performant data.table to work with. Just go see what you have to do in jsonlite to get a similar output (and how long it will take).

pryr::object_size(df) for that shows it’s consuming 394 MB, which means we can read in many more extracts comfortably on a reasonably configured system and most (if not all) of it on a well-configured AWS box.

However, if you do end up trying to work with that scan data, it’s highly recommended that you use jq to filter out the fields or records you want into a more compact ndjson file.

What’s inside the tin?

The following functions are implemented:

  • stream_in: Stream in ndjson from a file (handles .gz files)
  • validate: Validate JSON records in an ndjson file (handles .gz files)
  • flatten: Flatten a character vector of individual JSON lines

There are no current plans for a stream_out() function since jsonlite::stream_out() does a great job tossing data.frame-like structures out to an ndjson file.

What’s Inside The Tin

The following functions are implemented:

  • flatten: Flatten a character vector of individual JSON lines into a data.table
  • stream_in: Stream in & flatten an ndjson file into a data.table
  • validate: Validate ndjson file

Installation

install.packages("ndjson", repos = "https://cinc.rud.is")
# or
remotes::install_git("https://git.rud.is/hrbrmstr/ndjson.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/ndjson")
# or
remotes::install_gitlab("hrbrmstr/ndjson")
# or
remotes::install_bitbucket("hrbrmstr/ndjson")
# or
remotes::install_github("hrbrmstr/ndjson")

NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.

Usage

library(ndjson)

# current version
packageVersion("ndjson")
## [1] '0.8.0.9000'

Usage

library(microbenchmark)

flatten('{"top":{"next":{"final":1,"end":true},"another":"yes"},"more":"no"}')
##    more top.another top.next.end top.next.final
## 1:   no         yes         TRUE              1

f <- system.file("extdata", "test.json", package="ndjson")
gzf <- system.file("extdata", "testgz.json.gz", package="ndjson")

dplyr::glimpse(ndjson::stream_in(f))
## Observations: 100
## Variables: 8
## $ args                      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ headers.Accept            <chr> "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*",…
## $ `headers.Accept-Encoding` <chr> "identity", "identity", "identity", "identity", "identity", "identity", "identity",…
## $ headers.Host              <chr> "httpbin.org", "httpbin.org", "httpbin.org", "httpbin.org", "httpbin.org", "httpbin…
## $ `headers.User-Agent`      <chr> "Wget/1.18 (darwin15.5.0)", "Wget/1.18 (darwin15.5.0)", "Wget/1.18 (darwin15.5.0)",…
## $ id                        <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2…
## $ origin                    <chr> "50.252.233.22", "50.252.233.22", "50.252.233.22", "50.252.233.22", "50.252.233.22"…
## $ url                       <chr> "http://httpbin.org/stream/100", "http://httpbin.org/stream/100", "http://httpbin.o…
dplyr::glimpse(ndjson::stream_in(gzf))
## Observations: 100
## Variables: 8
## $ args                      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ headers.Accept            <chr> "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*",…
## $ `headers.Accept-Encoding` <chr> "identity", "identity", "identity", "identity", "identity", "identity", "identity",…
## $ headers.Host              <chr> "httpbin.org", "httpbin.org", "httpbin.org", "httpbin.org", "httpbin.org", "httpbin…
## $ `headers.User-Agent`      <chr> "Wget/1.18 (darwin15.5.0)", "Wget/1.18 (darwin15.5.0)", "Wget/1.18 (darwin15.5.0)",…
## $ id                        <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2…
## $ origin                    <chr> "50.252.233.22", "50.252.233.22", "50.252.233.22", "50.252.233.22", "50.252.233.22"…
## $ url                       <chr> "http://httpbin.org/stream/100", "http://httpbin.org/stream/100", "http://httpbin.o…

dplyr::glimpse(jsonlite::stream_in(file(f), flatten=TRUE, verbose=FALSE))
## Observations: 100
## Variables: 7
## $ url                       <chr> "http://httpbin.org/stream/100", "http://httpbin.org/stream/100", "http://httpbin.o…
## $ id                        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2…
## $ origin                    <chr> "50.252.233.22", "50.252.233.22", "50.252.233.22", "50.252.233.22", "50.252.233.22"…
## $ headers.Host              <chr> "httpbin.org", "httpbin.org", "httpbin.org", "httpbin.org", "httpbin.org", "httpbin…
## $ `headers.Accept-Encoding` <chr> "identity", "identity", "identity", "identity", "identity", "identity", "identity",…
## $ headers.Accept            <chr> "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*",…
## $ `headers.User-Agent`      <chr> "Wget/1.18 (darwin15.5.0)", "Wget/1.18 (darwin15.5.0)", "Wget/1.18 (darwin15.5.0)",…
dplyr::glimpse(jsonlite::stream_in(gzfile(gzf), flatten=TRUE, verbose=FALSE))
## Observations: 100
## Variables: 7
## $ url                       <chr> "http://httpbin.org/stream/100", "http://httpbin.org/stream/100", "http://httpbin.o…
## $ id                        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2…
## $ origin                    <chr> "50.252.233.22", "50.252.233.22", "50.252.233.22", "50.252.233.22", "50.252.233.22"…
## $ headers.Host              <chr> "httpbin.org", "httpbin.org", "httpbin.org", "httpbin.org", "httpbin.org", "httpbin…
## $ `headers.Accept-Encoding` <chr> "identity", "identity", "identity", "identity", "identity", "identity", "identity",…
## $ headers.Accept            <chr> "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*", "*/*",…
## $ `headers.User-Agent`      <chr> "Wget/1.18 (darwin15.5.0)", "Wget/1.18 (darwin15.5.0)", "Wget/1.18 (darwin15.5.0)",…

microbenchmark(
    ndjson = { ndjson::stream_in(f) },
  jsonlite = { jsonlite::stream_in(file(f), flatten=TRUE, verbose=FALSE) }
)
## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval cld
##    ndjson 2.435400 2.508409 2.607311 2.554602 2.611543 6.070535   100  a 
##  jsonlite 4.177671 4.392665 4.530934 4.555521 4.656247 5.029599   100   b

microbenchmark(
    ndjson = { ndjson::stream_in(gzf) },
  jsonlite = { jsonlite::stream_in(gzfile(gzf), flatten=TRUE, verbose=FALSE) }
)
## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval cld
##    ndjson 2.208561 2.313191 2.371382 2.370058 2.422588 2.622296   100  a 
##  jsonlite 3.417319 3.576970 3.685897 3.664169 3.816465 4.258603   100   b

ndjson Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
C++ 3 0.33 338 0.74 105 0.62 55 0.21
C/C++ Header 1 0.11 66 0.14 15 0.09 40 0.16
R 4 0.44 28 0.06 6 0.04 57 0.22
Rmd 1 0.11 24 0.05 43 0.25 104 0.41

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].