Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → maxim2266 → csvplus

maxim2266 / csvplus

Licence: BSD-3-Clause license

csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.

Programming Languages

31211 projects - #10 most used programming language

30231 projects

Labels

csv etl stream-processing fluent-interface csv-format go-csv etl-framework etl-pipeline

Projects that are alternatives of or similar to csvplus

EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.

Stars: ✭ 38 (-43.28%)

Mutual labels: etl, etl-framework, etl-pipeline

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-41.79%)

Mutual labels: etl, etl-framework, etl-pipeline

redis-connect-dist

Real-Time Event Streaming & Change Data Capture

Stars: ✭ 21 (-68.66%)

Mutual labels: etl, etl-framework, etl-pipeline

A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.

Stars: ✭ 612 (+813.43%)

Mutual labels: etl, etl-framework, etl-pipeline

DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.

Stars: ✭ 20 (-70.15%)

Mutual labels: etl, etl-framework, etl-pipeline

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Stars: ✭ 24 (-64.18%)

Mutual labels: etl, etl-framework, etl-pipeline

www.vixtract.ru

Stars: ✭ 40 (-40.3%)

Mutual labels: etl, etl-framework, etl-pipeline

A Python stream processing engine modeled after Yahoo! Pipes

Stars: ✭ 1,571 (+2244.78%)

Mutual labels: etl, stream-processing

Openkettlewebui

一款基于kettle的数据处理web调度控制平台，支持文档资源库和数据库资源库，通过web平台控制kettle数据转换，可作为中间件集成到现有系统中

Stars: ✭ 125 (+86.57%)

Mutual labels: etl, etl-framework

A tool for building feature stores.

Stars: ✭ 126 (+88.06%)

Mutual labels: etl, etl-framework

Bender - Serverless ETL Framework

Stars: ✭ 171 (+155.22%)

Mutual labels: etl, etl-framework

(Spatial) data harmonisation with hale studio (formerly HUMBOLDT Alignment Editor)

Stars: ✭ 84 (+25.37%)

Mutual labels: etl, etl-framework

python ETL framework

Stars: ✭ 33 (-50.75%)

Mutual labels: etl, etl-framework

A visual ETL development and debugging tool for big data

Stars: ✭ 144 (+114.93%)

Mutual labels: etl, etl-framework

A lightweight ETL (extract, transform, load) library and data integration toolbox for .NET.

Stars: ✭ 203 (+202.99%)

Mutual labels: etl, etl-framework

Configurable Extract, Transform, and Load

Stars: ✭ 125 (+86.57%)

Mutual labels: etl, etl-framework

Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.

Stars: ✭ 64 (-4.48%)

Mutual labels: etl, etl-framework

mito ETL tool

Stars: ✭ 153 (+128.36%)

Mutual labels: etl, etl-framework

Blog post on ETL pipelines with Airflow

Stars: ✭ 20 (-70.15%)

Mutual labels: etl, etl-pipeline

blockchain-etl-streaming

Streaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes

Stars: ✭ 57 (-14.93%)

Mutual labels: etl, stream-processing

View All Similar Projects ➔

csvplus

Package csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream processing operations, indices and joins.

The library is primarily designed for ETL-like processes. It is mostly useful in places where the more advanced searching/joining capabilities of a fully-featured SQL database are not required, but the same time the data transformations needed still include SQL-like operations.

License: BSD

Examples

Simple sequential processing:

people := csvplus.FromFile("people.csv").SelectColumns("name", "surname", "id")

err := csvplus.Take(people).
	Filter(csvplus.Like(csvplus.Row{"name": "Amelia"})).
	Map(func(row csvplus.Row) csvplus.Row { row["name"] = "Julia"; return row }).
	ToCsvFile("out.csv", "name", "surname")

if err != nil {
	return err
}

More involved example:

customers := csvplus.FromFile("people.csv").SelectColumns("id", "name", "surname")
custIndex, err := csvplus.Take(customers).UniqueIndexOn("id")

if err != nil {
	return err
}

products := csvplus.FromFile("stock.csv").SelectColumns("prod_id", "product", "price")
prodIndex, err := csvplus.Take(products).UniqueIndexOn("prod_id")

if err != nil {
	return err
}

orders := csvplus.FromFile("orders.csv").SelectColumns("cust_id", "prod_id", "qty", "ts")
iter := csvplus.Take(orders).Join(custIndex, "cust_id").Join(prodIndex)

return iter(func(row csvplus.Row) error {
	// prints lines like:
	//	John Doe bought 38 oranges for £0.03 each on 2016-09-14T08:48:22+01:00
	_, e := fmt.Printf("%s %s bought %s %ss for £%s each on %s\n",
		row["name"], row["surname"], row["qty"], row["product"], row["price"], row["ts"])
	return e
})

Design principles

The package functionality is based on the operations on the following entities:

type Row
type DataSource
type Index

Type `Row`

Row represents one row from a DataSource. It is a map from column names to the string values under those columns on the current row. The package expects a unique name assigned to every column at source. Compared to using integer indices this provides more convenience when complex transformations get applied to each row during processing.

type `DataSource`

Type DataSource represents any source of zero or more rows, like .csv file. This is a function that when invoked feeds the given callback with the data from its source, one Row at a time. The type also has a number of operations defined on it that provide for easy composition of the operations on the DataSource, forming so called fluent interface. All these operations are 'lazy', i.e. they are not performed immediately, but instead each of them returns a new DataSource.

There is also a number of convenience operations that actually invoke the DataSource function to produce a specific type of output:

IndexOn to build an index on the specified column(s);
UniqueIndexOn to build a unique index on the specified column(s);
ToCsv to serialise the DataSource to the given io.Writer in .csv format;
ToCsvFile to store the DataSource in the specified file in .csv format;
ToJSON to serialise the DataSource to the given io.Writer in JSON format;
ToJSONFile to store the DataSource in the specified file in JSON format;
ToRows to convert the DataSource to a slice of Rows.

Type `Index`

Index is a sorted collection of rows. The sorting is performed on the columns specified when the index is created. Iteration over an index yields a sorted sequence of rows. An Index can be joined with a DataSource. The type has operations for finding rows and creating sub-indices in O(log(n)) time. Another useful operation is resolving duplicates. Building an index takes O(n*log(n)) time. It should be noted that the Index building operation requires the entire dataset to be read into the memory, so certain care should be taken when indexing huge datasets. An index can also be stored to, or loaded from a disk file.

For more details see the documentation.

Project status

The project is in a usable state usually called "beta". Tested on Linux Mint 18.3 using Go version 1.10.2.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 67

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗