All Projects → basin-etl → basin

basin-etl / basin

Licence: other
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

Programming Languages

typescript
32286 projects
Vue
7211 projects
python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
HTML
75241 projects
Dockerfile
14818 projects
shell
77523 projects

Projects that are alternatives of or similar to basin

Wedatasphere
WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!
Stars: ✭ 372 (+1388%)
Mutual labels:  spark, hadoop, etl
lineage
Generate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (-36%)
Mutual labels:  pipeline, etl, pyspark
Datavec
ETL Library for Machine Learning - data pipelines, data munging and wrangling
Stars: ✭ 272 (+988%)
Mutual labels:  spark, pipeline, etl
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+2432%)
Mutual labels:  spark, etl, pyspark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+344%)
Mutual labels:  spark, hadoop, pyspark
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+56%)
Mutual labels:  hadoop, etl, pyspark
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+1524%)
Mutual labels:  spark, hadoop, pyspark
Dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (+4680%)
Mutual labels:  spark, hadoop, etl
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+216%)
Mutual labels:  spark, pipeline, etl
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+500%)
Mutual labels:  spark, hadoop, pyspark
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-32%)
Mutual labels:  pipeline, etl, pyspark
big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (+36%)
Mutual labels:  hadoop, pyspark
fastdata-cluster
Fast Data Cluster (Apache Cassandra, Kafka, Spark, Flink, YARN and HDFS with Vagrant and VirtualBox)
Stars: ✭ 20 (-20%)
Mutual labels:  spark, hadoop
swordfish
Open-source distribute workflow schedule tools, also support streaming task.
Stars: ✭ 35 (+40%)
Mutual labels:  spark, hadoop
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (+100%)
Mutual labels:  spark, pyspark
TIL
Today I Learned
Stars: ✭ 43 (+72%)
Mutual labels:  hadoop, pipeline
ODSC India 2018
My presentation at ODSC India 2018 about Deep Learning with Apache Spark
Stars: ✭ 26 (+4%)
Mutual labels:  spark, pyspark
spark-util
low-level helpers for Apache Spark libraries and tests
Stars: ✭ 16 (-36%)
Mutual labels:  spark, hadoop
etl
M-Lab ingestion pipeline
Stars: ✭ 15 (-40%)
Mutual labels:  pipeline, etl
Addax
Addax is an open source universal ETL tool that supports most of those RDBMS and NoSQLs on the planet, helping you transfer data from any one place to another.
Stars: ✭ 615 (+2360%)
Mutual labels:  hadoop, etl

Basin

Extract, transform, load using visual programming that can run Spark jobs on any environment

Create and debug from your browser and export into pure python code!

Basin screenshot

Features

  • Up and running as simple as docker pull

  • Create complex pipelines and flows using drag and drop

  • Debug and preview step by step

  • Integrated dataview grid viewer for easier debugging

  • Auto-generates comments so you don't have to

  • Export to beautiful, pure python code

  • Build artifacts for AWS Glue deployment (Work in progress)

Install

Install from dockerhub

$ docker pull zalmane/basin:latest

Create data folder

$ mkdir data This is the folder that will hold all input and output files

Run image

Run image mapping data directory to your local environment. This is where input/output goes (extract and load)

docker run --rm -d -v $PWD/data:/opt/basin/data --name basin_server -p 3000:3000 zalmane/basin:latest

That's it. Point your browser to http://localhost:3000 and you're done!

Notes:

  • Metadata is stored in the browser's indexeddb.

Install from source

Install dev environment with docker

docker-compose up

This will set up 2 containers: basin-client and basin-server

That's it. Point your browser to http://localhost:8860 and you're done!

To run npm commands in the basin-client container use:

docker exec basin-client npm <command>

To update changes in py files (block templates, lib), use:

docker exec basin-client npm run build-py

Getting started

Creating sources

A source defines the information needed to parse and import a dataset. Sources are referenced when using an Extract block. The source defines the following information:

  • type of file (delimited, fixed width, json, parquet)
  • regular expression to match when identifying the file. This will match against the file name
  • information about headers and footers
  • specific metadata based on type of file (for csv includes the delimiter etc)

Creating a flow

Running and debugging a flow

Exporting to python code

Configuration

Extending

Creating new block types

Each block type consists of:

  • Descriptor json
  • code template
  • optional code library template
  • Properties panel

Descriptor

Code template

Ccode library template

Properties panel

License

This program is free software: you can redistribute it and/or modify it under the terms of the Server Side Public License, version 1, as published by MongoDB, Inc. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Server Side Public License for more details. You should have received a copy of the Server Side Public License along with this program. If not, see http://www.mongodb.com/licensing/server-side-public-license

Copyright © 2018-2020 G.M.M Ltd.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].