reproio / columnify

Licence: Apache-2.0 license

Make record oriented data to columnar format.

Programming Languages

31211 projects - #10 most used programming language

30231 projects

Projects that are alternatives of or similar to columnify

A cross-platform (Windows, MAC, Linux) desktop application to view common bigdata binary format like Parquet, ORC, AVRO, etc. Support local file system, HDFS, AWS S3, Azure Blob Storage ,etc.

Stars: ✭ 86 (+207.14%)

Mutual labels: avro, bigdata, parquet

Iceberg

Iceberg is a table format for large, slow-moving tabular data

Stars: ✭ 393 (+1303.57%)

Mutual labels: avro, parquet

Choetl

ETL Framework for .NET / c# (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)

Stars: ✭ 372 (+1228.57%)

Mutual labels: avro, parquet

Rumble

⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Stars: ✭ 58 (+107.14%)

Mutual labels: avro, parquet

centurion

Kotlin Bigdata Toolkit

Stars: ✭ 320 (+1042.86%)

Mutual labels: bigdata, parquet

Ratatool

A tool for data sampling, data generation, and data diffing

Stars: ✭ 279 (+896.43%)

Mutual labels: avro, parquet

Gcs Tools

GCS support for avro-tools, parquet-tools and protobuf

Stars: ✭ 57 (+103.57%)

Mutual labels: avro, parquet

DaFlow

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Stars: ✭ 24 (-14.29%)

Mutual labels: avro, parquet

Schemer

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

Stars: ✭ 97 (+246.43%)

Mutual labels: avro, parquet

Avro

Apache Avro is a data serialization system.

Stars: ✭ 2,005 (+7060.71%)

Mutual labels: avro, bigdata

Bigdata Playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

Stars: ✭ 177 (+532.14%)

Mutual labels: avro, parquet

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Stars: ✭ 406 (+1350%)

Mutual labels: avro, parquet

parquet-flinktacular

How to use Parquet in Flink

Stars: ✭ 29 (+3.57%)

Mutual labels: avro, parquet

Vscode Data Preview

Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files

Stars: ✭ 245 (+775%)

Mutual labels: avro, parquet

parquet-extra

A collection of Apache Parquet add-on modules

Stars: ✭ 30 (+7.14%)

Mutual labels: avro, parquet

amas

Amas is recursive acronym for “Amas, monitor alert system”.

Stars: ✭ 77 (+175%)

Mutual labels: bigdata

qs-hadoop

大数据生态圈学习

Stars: ✭ 18 (-35.71%)

Mutual labels: bigdata

srclient

Golang Client for Schema Registry

Stars: ✭ 188 (+571.43%)

Mutual labels: avro

albis

Albis: High-Performance File Format for Big Data Systems

Stars: ✭ 20 (-28.57%)

Mutual labels: parquet

terraform-aws-kinesis-firehose

This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.

Stars: ✭ 25 (-10.71%)

Mutual labels: parquet

View All Similar Projects ➔

columnify

Make record oriented data to columnar format.

Synopsis

Columnar formatted data is efficient for analytics queries, lightweight and ease to integrate with Data WareHouse middleware's. Conversion from record oriented data to columnar is sometimes realized by BigData stack like Hadoop ecosystem, and there's no easy way to do it lightly and quickly.

columnify is an easy conversion tool for columnar that enables to run single binary written in Go. It also supports some kinds of data format like JSONL(NewLine delimited JSON), Avro.

How to use

Installation

$ go install github.com/reproio/columnify/cmd/columnify@latest

Usage

$ ./columnify -h
Usage of columnify: columnify [-flags] [input files]
  -output string
        path to output file; default: stdout
  -recordType string
        data type, [avro|csv|jsonl|ltsv|msgpack|tsv] (default "jsonl")
  -schemaFile string
        path to schema file
  -schemaType string
        schema type, [avro|bigquery]

Example

$ cat examples/record/primitives.jsonl
{"boolean": false, "int": 1, "long": 1, "float": 1.1, "double": 1.1, "bytes": "foo", "string": "foo"}
{"boolean": true, "int": 2, "long": 2, "float": 2.2, "double": 2.2, "bytes": "bar", "string": "bar"}

$ ./columnify -schemaType avro -schemaFile examples/primitives.avsc -recordType jsonl examples/primitives.jsonl > out.parquet

$ parquet-tools schema out.parquet
message Primitives {
  required boolean boolean;
  required int32 int;
  required int64 long;
  required float float;
  required double double;
  required binary bytes;
  required binary string (UTF8);
}

$ parquet-tools cat -json out.parquet
{"boolean":false,"int":1,"long":1,"float":1.1,"double":1.1,"bytes":"Zm9v","string":"foo"}
{"boolean":true,"int":2,"long":2,"float":2.2,"double":2.2,"bytes":"YmFy","string":"bar"}

Supported formats

Input

Apache Avro
CSV
JSONL(NewLine delimited JSON)
LTSV
Message Pack
TSV

Output

Apache Parquet

Schema

Integration example

fluent-plugin-s3 parquet compressor
- An example is examples/fluent-plugin-s3
- It works as a Compressor of fluent-plugin-s3 write parquet file to tmp via chunk data.

Additional tips

Set GOGC to reduce memory usage

columnify might consume lots of memory depending on a value specified by -parquetRowGroupSize. At least, it needs a memory of the row group size. Actually, it consumes more than double the row group size by default. The reason for that depends on Go's garbage collection behavior, and memory usage can decrease by triggering GC frequently. To adjust the frequency, set GOGC environment variable.

SetGCPercent sets the garbage collection target percentage: a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. SetGCPercent returns the previous setting. The initial setting is the value of the GOGC environment variable at startup, or 100 if the variable is not set. A negative percentage disables garbage collection.

https://golang.org/pkg/runtime/debug/#SetGCPercent

Of course, frequent GC makes it increase execution time. Confirm which GOGC value (percent) is better in your environment.

Limitations

Currently it has some limitations from schema/record types.

Some logical types like Decimal are unsupported.
If using -recordType = avro, it doesn't support a nested record has only 1 sub field.
If using -recordType = avro, it converts bytes fields to base64 encoded value implicitly.

Development

Columnifier reads input file(s), converts format based on given parameter, finally writes output files. Format conversion is separated by schema / record. The schema conversion accepts input schema, then converts it to targer's via Arrow's schema. The record conversion is similar to schema's but intermediate is simply map[string]interface{}, because Arrow record isn't available as an intermediate. columnify basically depends on existing modules but it contains additional modules like avro, parquet to fill insufficient features.

Release

goreleaser is integrated in GitHub Actions. It's triggerd on creating a new tag. Create a new release with semvar tag(vx.y.z) on this GitHub repo, then you get archives for some environments attached on the release.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

reproio / columnify

Programming Languages

Labels

Projects that are alternatives of or similar to columnify

columnify

Synopsis

How to use

Installation

Usage

Example

Supported formats

Input

Output

Schema

Integration example

Additional tips

Set GOGC to reduce memory usage

Limitations

Development

Release