All Projects → jjmontesl → cubetl

jjmontesl / cubetl

Licence: MIT license
CubETL - Framework and tool for data ETL (Extract, Transform and Load) in Python (PERSONAL PROJECT / SELDOM MAINTAINED)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to cubetl

qwery
A SQL-like language for performing ETL transformations.
Stars: ✭ 28 (+33.33%)
Mutual labels:  csv, etl, etl-framework
Pyetl
python ETL framework
Stars: ✭ 33 (+57.14%)
Mutual labels:  csv, etl, etl-framework
Choetl
ETL Framework for .NET / c# (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)
Stars: ✭ 372 (+1671.43%)
Mutual labels:  csv, etl, etl-framework
redis-connect-dist
Real-Time Event Streaming & Change Data Capture
Stars: ✭ 21 (+0%)
Mutual labels:  csv, etl, etl-framework
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (+14.29%)
Mutual labels:  csv, etl, etl-framework
Csv2db
The CSV to database command line loader
Stars: ✭ 102 (+385.71%)
Mutual labels:  csv, etl
Etl.net
Mass processing data with a complete ETL for .net developers
Stars: ✭ 129 (+514.29%)
Mutual labels:  csv, etl
vixtract
www.vixtract.ru
Stars: ✭ 40 (+90.48%)
Mutual labels:  etl, etl-framework
BETL-old
BETL. Meta data driven ETL generation using T-SQL
Stars: ✭ 17 (-19.05%)
Mutual labels:  etl, etl-framework
Ethereum Etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 956 (+4452.38%)
Mutual labels:  csv, etl
DIRECT
DIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.
Stars: ✭ 20 (-4.76%)
Mutual labels:  etl, etl-framework
etlflow
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.
Stars: ✭ 38 (+80.95%)
Mutual labels:  etl, etl-framework
Etl with python
ETL with Python - Taught at DWH course 2017 (TAU)
Stars: ✭ 68 (+223.81%)
Mutual labels:  csv, etl
Ether sql
A python library to push ethereum blockchain data into an sql database.
Stars: ✭ 41 (+95.24%)
Mutual labels:  csv, etl
Omniparser
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
Stars: ✭ 148 (+604.76%)
Mutual labels:  csv, etl
link-move
A model-driven dynamically-configurable framework to acquire data from external sources and save it to your database.
Stars: ✭ 32 (+52.38%)
Mutual labels:  etl, etl-framework
csvplus
csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.
Stars: ✭ 67 (+219.05%)
Mutual labels:  etl, etl-framework
flock
Flock: A Low-Cost Streaming Query Engine on FaaS Platforms
Stars: ✭ 232 (+1004.76%)
Mutual labels:  etl, olap
OpenKettleWebUI
一款基于kettle的数据处理web调度控制平台,支持文档资源库和数据库资源库,通过web平台控制kettle数据转换,可作为中间件集成到现有系统中
Stars: ✭ 138 (+557.14%)
Mutual labels:  etl, etl-framework
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+85.71%)
Mutual labels:  etl, etl-framework

CubETL

CubETL is a tool for data ETL (Extract, Transform and Load).

CubETL provides a mechanism to run data items through a processing pipeline. It takes care of initializing only the components used by the process, manages the data flow across the process graph, logging and cleanup. It can be used as a tool or from Python code.

Features:

  • Consumes and produces CSV, XML, JSON...
  • SQL support (querying, inserting/updating, schema creation, schema loading)
  • OLAP support:
    • Star-schema generation and data loading
    • SQL-to-OLAP schema generator
    • Cubes OLAP Server model export
    • SDMX and PC-Axis read support
  • Text templating, GeoIP, HTTP requests...
  • Insert / upsert for memory tables, SQL tables and OLAP entities
  • Extensible
  • Caching

It provides out-of-the-box components that handle common data formats and it also includes SQL and OLAP modules that understand SQL and OLAP schemas and map data across them. This allows CubETL to insert OLAP facts across multiple tables in a single store operation, while automatically performing (and caching) the appropriate lookups.

CubETL can also inspect an existing relational database and generate an OLAP schema, and the other way around. It can also produce configuration for Databrewery Cubes Server. All together it allows for a quick analytical inspection of an arbitrary database (see the examples below).

Note: This project is in development stage and is tested in few environments. Documentation is incomplete. You will hit issues. Please use the issue tracker for bugs, questions, suggestions and contributions.

Download / Install

While CubETL is in development no pip packages are provided, and it should be installed using python setup.py develop. Using a virtualenv or docker instance is recommended.

git clone https://github.com/jjmontesl/cubetl.git
cd cubetl

# Using a virtualenv is usually recommended:
python3 -m venv env
. env/bin/activate

# Install dependencies (this command is for Ubuntu)
sudo apt-get install python3-dev libxml2-dev libxslt1-dev zlib1g-dev

# Install CubETL (in development mode, so you can make changes to the source)
python setup.py develop

# Test the executable
cubetl -h

Install on docker

First, install Docker. Then:

docker-compose build

Test the executable (quick):

docker-compose run cubetls /usr/local/bin/cubetl -h

Optionally, run tests:

# Run pytest tests (this is currently quite slow: 20-30 minutes):
docker-compose run cubetls pytest -v

# Run one pytest:
docker-compose run cubetls python -m pytest -v -k test_config_new

Usage

Cubetl provides a command line tool, cubetl:

cubetl [-dd] [-q] [-h] [-p property=value] [-m attribute=value] [config.py ...] <start-node>

    -p   set a context property
    -m   set an attribute for the start item
    -d   debug mode (can be used twice for extra debug)
    -q   quiet mode (bypass print nodes)
    -l   list config nodes ('cubetl.config.list' as start-node)
    -h   show this help and exit
    -v   print version and exit

Each CubETL configuration can contain one or more process nodes. You must specify the list of configuration files (.py files), followed by the process nodes you want to run.

Note that, when running a complete ETL process, you need to remove print nodes or to use the -q command line option to remove prints to standard output, which will otherwise heavily slowdown the process.

You can also use CubETL directly from Python code.

Examples

Visualizing a SQL database

CubETL can inspect a SQL database and generate a CubETL OLAP schema and SQL mappings for it. Such schema can then be visualized using CubesViewer:

# Inspect database and generate a cubes model and config
cubetl cubetl.sql.db2sql cubetl.olap.sql2olap cubetl.cubes.olap2cubes \
    -p db2sql.db_url=sqlite:///mydb.sqlite3 \
    -p olap2cubes.cubes_model=mydb.cubes-model.json \
    -p olap2cubes.cubes_config=mydb.cubes-config.ini

# Run cubes server (in background with &)
pip install https://github.com/DataBrewery/cubes/archive/master.zip click flask --upgrade
slicer serve mydb.cubes-config.ini &

# Run a local cubesviewer HTTP server (also opens a browser)
# NOTE: not yet available, please download and use CubesViewer manually!
pip install cubesviewer-utils
cvutils cv

This will open a browser pointing to a local CubesViewer instance pointing to the previously launched Cubes server. Alternatively, you can download CubesViewer and load the HTML application locally.

The CubETL project contains an example database that you can use to test this (see the Generate OLAP schema from SQL database and visualize example below).

You can control the schema generation process using with options. Check the documentation below for further information.

Creating a new ETL process config

Create a new directory for your ETL process and inside it run:

cubetl cubetl.config.new -p config.name=myprocess

This will create a new file myprocess.py, which you can use as a template for your new ETL process.

The created example config includes an entry node called myprocess.process, which simply prints a message to console. You can test your ETL process using:

cubetl myprocess.py myprocess.process

See the example ETL below fore more examples, the documentation section for information about how to define ETL processes.

Example ETL processes

Example ETL processes included with the project:

To run these examples you'll need the examples directory of the cubetl project, which is not included in the PyPI pip download. You can get them by cloning the cubetl project repository (git clone https://github.com/jjmontesl/cubetl.git) or by downloading the packaged project.

Running from Python

In order to configure and/or run a process from client code, use:

import cubetl

# Create Cubetl context
ctx = cubetl.cubetl()

# Add components or include a configuration file...
ctx.add('your_app.node_name', ...)

# Launch process
result = ctx.run("your_app.node_name")

See the examples/python directory for a full working example.

Documentation

Guide:

  • Sequential formats (text files, CSV, JSON...)
  • Tables
  • Process flow (unions, chain forks, repeats...)
  • SQL
  • OLAP
  • Script nodes, custom functions, custom nodes and custom components

Config Library:

Doc:

Examples:

  • See "Example ETL processes" above

Support

If you have questions, problems or suggestions, please use:

If you are using or trying CubETL, please tweet #cubetl.

Source

Github source repository: https://github.com/jjmontesl/cubetl

Authors

CubETL is written and maintained by Jose Juan Montes.

See AUTHORS file for more information.

License

CubETL is published under MIT license. For full license see the LICENSE file.

Other sources:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].