All Projects → maxim2266 → csvplus

maxim2266 / csvplus

Licence: BSD-3-Clause license
csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.

Programming Languages

go
31211 projects - #10 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to csvplus

etlflow
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.
Stars: ✭ 38 (-43.28%)
Mutual labels:  etl, etl-framework, etl-pipeline
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-41.79%)
Mutual labels:  etl, etl-framework, etl-pipeline
redis-connect-dist
Real-Time Event Streaming & Change Data Capture
Stars: ✭ 21 (-68.66%)
Mutual labels:  etl, etl-framework, etl-pipeline
hamilton
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+813.43%)
Mutual labels:  etl, etl-framework, etl-pipeline
DIRECT
DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.
Stars: ✭ 20 (-70.15%)
Mutual labels:  etl, etl-framework, etl-pipeline
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-64.18%)
Mutual labels:  etl, etl-framework, etl-pipeline
vixtract
www.vixtract.ru
Stars: ✭ 40 (-40.3%)
Mutual labels:  etl, etl-framework, etl-pipeline
Riko
A Python stream processing engine modeled after Yahoo! Pipes
Stars: ✭ 1,571 (+2244.78%)
Mutual labels:  etl, stream-processing
Openkettlewebui
一款基于kettle的数据处理web调度控制平台,支持文档资源库和数据库资源库,通过web平台控制kettle数据转换,可作为中间件集成到现有系统中
Stars: ✭ 125 (+86.57%)
Mutual labels:  etl, etl-framework
Butterfree
A tool for building feature stores.
Stars: ✭ 126 (+88.06%)
Mutual labels:  etl, etl-framework
Bender
Bender - Serverless ETL Framework
Stars: ✭ 171 (+155.22%)
Mutual labels:  etl, etl-framework
Hale
(Spatial) data harmonisation with hale studio (formerly HUMBOLDT Alignment Editor)
Stars: ✭ 84 (+25.37%)
Mutual labels:  etl, etl-framework
Pyetl
python ETL framework
Stars: ✭ 33 (-50.75%)
Mutual labels:  etl, etl-framework
Hydrograph
A visual ETL development and debugging tool for big data
Stars: ✭ 144 (+114.93%)
Mutual labels:  etl, etl-framework
Etlbox
A lightweight ETL (extract, transform, load) library and data integration toolbox for .NET.
Stars: ✭ 203 (+202.99%)
Mutual labels:  etl, etl-framework
Transformalize
Configurable Extract, Transform, and Load
Stars: ✭ 125 (+86.57%)
Mutual labels:  etl, etl-framework
Stetl
Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
Stars: ✭ 64 (-4.48%)
Mutual labels:  etl, etl-framework
Metl
mito ETL tool
Stars: ✭ 153 (+128.36%)
Mutual labels:  etl, etl-framework
AirflowETL
Blog post on ETL pipelines with Airflow
Stars: ✭ 20 (-70.15%)
Mutual labels:  etl, etl-pipeline
blockchain-etl-streaming
Streaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes
Stars: ✭ 57 (-14.93%)
Mutual labels:  etl, stream-processing

csvplus

GoDoc Go Report Card License: BSD 3-Clause

Package csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream processing operations, indices and joins.

The library is primarily designed for ETL-like processes. It is mostly useful in places where the more advanced searching/joining capabilities of a fully-featured SQL database are not required, but the same time the data transformations needed still include SQL-like operations.

License: BSD

Examples

Simple sequential processing:

people := csvplus.FromFile("people.csv").SelectColumns("name", "surname", "id")

err := csvplus.Take(people).
	Filter(csvplus.Like(csvplus.Row{"name": "Amelia"})).
	Map(func(row csvplus.Row) csvplus.Row { row["name"] = "Julia"; return row }).
	ToCsvFile("out.csv", "name", "surname")

if err != nil {
	return err
}

More involved example:

customers := csvplus.FromFile("people.csv").SelectColumns("id", "name", "surname")
custIndex, err := csvplus.Take(customers).UniqueIndexOn("id")

if err != nil {
	return err
}

products := csvplus.FromFile("stock.csv").SelectColumns("prod_id", "product", "price")
prodIndex, err := csvplus.Take(products).UniqueIndexOn("prod_id")

if err != nil {
	return err
}

orders := csvplus.FromFile("orders.csv").SelectColumns("cust_id", "prod_id", "qty", "ts")
iter := csvplus.Take(orders).Join(custIndex, "cust_id").Join(prodIndex)

return iter(func(row csvplus.Row) error {
	// prints lines like:
	//	John Doe bought 38 oranges for £0.03 each on 2016-09-14T08:48:22+01:00
	_, e := fmt.Printf("%s %s bought %s %ss for £%s each on %s\n",
		row["name"], row["surname"], row["qty"], row["product"], row["price"], row["ts"])
	return e
})

Design principles

The package functionality is based on the operations on the following entities:

  • type Row
  • type DataSource
  • type Index

Type Row

Row represents one row from a DataSource. It is a map from column names to the string values under those columns on the current row. The package expects a unique name assigned to every column at source. Compared to using integer indices this provides more convenience when complex transformations get applied to each row during processing.

type DataSource

Type DataSource represents any source of zero or more rows, like .csv file. This is a function that when invoked feeds the given callback with the data from its source, one Row at a time. The type also has a number of operations defined on it that provide for easy composition of the operations on the DataSource, forming so called fluent interface. All these operations are 'lazy', i.e. they are not performed immediately, but instead each of them returns a new DataSource.

There is also a number of convenience operations that actually invoke the DataSource function to produce a specific type of output:

  • IndexOn to build an index on the specified column(s);
  • UniqueIndexOn to build a unique index on the specified column(s);
  • ToCsv to serialise the DataSource to the given io.Writer in .csv format;
  • ToCsvFile to store the DataSource in the specified file in .csv format;
  • ToJSON to serialise the DataSource to the given io.Writer in JSON format;
  • ToJSONFile to store the DataSource in the specified file in JSON format;
  • ToRows to convert the DataSource to a slice of Rows.

Type Index

Index is a sorted collection of rows. The sorting is performed on the columns specified when the index is created. Iteration over an index yields a sorted sequence of rows. An Index can be joined with a DataSource. The type has operations for finding rows and creating sub-indices in O(log(n)) time. Another useful operation is resolving duplicates. Building an index takes O(n*log(n)) time. It should be noted that the Index building operation requires the entire dataset to be read into the memory, so certain care should be taken when indexing huge datasets. An index can also be stored to, or loaded from a disk file.

For more details see the documentation.

Project status

The project is in a usable state usually called "beta". Tested on Linux Mint 18.3 using Go version 1.10.2.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].