basin-etl / basin

Licence: other

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

Programming Languages

typescript

32286 projects

Vue

7211 projects

python

139335 projects - #7 most used programming language

javascript

184084 projects - #8 most used programming language

HTML

75241 projects

Dockerfile

14818 projects

shell

77523 projects

Projects that are alternatives of or similar to basin

Wedatasphere

WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!

Stars: ✭ 372 (+1388%)

Mutual labels: spark, hadoop, etl

lineage

Generate beautiful documentation for your data pipelines in markdown format

Stars: ✭ 16 (-36%)

Mutual labels: pipeline, etl, pyspark

Datavec

ETL Library for Machine Learning - data pipelines, data munging and wrangling

Stars: ✭ 272 (+988%)

Mutual labels: spark, pipeline, etl

Pyspark Example Project

Example project implementing best practices for PySpark ETL jobs and applications.

Stars: ✭ 633 (+2432%)

Mutual labels: spark, etl, pyspark

aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (+344%)

Mutual labels: spark, hadoop, pyspark

datalake-etl-pipeline

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (+56%)

Mutual labels: hadoop, etl, pyspark

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Stars: ✭ 406 (+1524%)

Mutual labels: spark, hadoop, pyspark

Dataspherestudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

Stars: ✭ 1,195 (+4680%)

Mutual labels: spark, hadoop, etl

Setl

A simple Spark-powered ETL framework that just works 🍺

Stars: ✭ 79 (+216%)

Mutual labels: spark, pipeline, etl

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (+500%)

Mutual labels: spark, hadoop, pyspark

sparklanes

A lightweight data processing framework for Apache Spark

Stars: ✭ 17 (-32%)

Mutual labels: pipeline, etl, pyspark

big data

A collection of tutorials on Hadoop, MapReduce, Spark, Docker

Stars: ✭ 34 (+36%)

Mutual labels: hadoop, pyspark

fastdata-cluster

Fast Data Cluster (Apache Cassandra, Kafka, Spark, Flink, YARN and HDFS with Vagrant and VirtualBox)

Stars: ✭ 20 (-20%)

Mutual labels: spark, hadoop

swordfish

Open-source distribute workflow schedule tools, also support streaming task.

Stars: ✭ 35 (+40%)

Mutual labels: spark, hadoop

data processing course

Some class materials for a data processing course using PySpark

Stars: ✭ 50 (+100%)

Mutual labels: spark, pyspark

TIL

Today I Learned

Stars: ✭ 43 (+72%)

Mutual labels: hadoop, pipeline

ODSC India 2018

My presentation at ODSC India 2018 about Deep Learning with Apache Spark

Stars: ✭ 26 (+4%)

Mutual labels: spark, pyspark

spark-util

low-level helpers for Apache Spark libraries and tests

Stars: ✭ 16 (-36%)

Mutual labels: spark, hadoop

etl

M-Lab ingestion pipeline

Stars: ✭ 15 (-40%)

Mutual labels: pipeline, etl

Addax

Addax is an open source universal ETL tool that supports most of those RDBMS and NoSQLs on the planet, helping you transfer data from any one place to another.

Stars: ✭ 615 (+2360%)

Mutual labels: hadoop, etl

View All Similar Projects ➔

Basin

Extract, transform, load using visual programming that can run Spark jobs on any environment

Create and debug from your browser and export into pure python code!

Features

Up and running as simple as docker pull
Create complex pipelines and flows using drag and drop
Debug and preview step by step
Integrated dataview grid viewer for easier debugging
Auto-generates comments so you don't have to
Export to beautiful, pure python code
Build artifacts for AWS Glue deployment (Work in progress)

Install

Install from dockerhub

$ docker pull zalmane/basin:latest

Create data folder

$ mkdir data This is the folder that will hold all input and output files

Run image

Run image mapping data directory to your local environment. This is where input/output goes (extract and load)

docker run --rm -d -v $PWD/data:/opt/basin/data --name basin_server -p 3000:3000 zalmane/basin:latest

That's it. Point your browser to http://localhost:3000 and you're done!

Notes:

Metadata is stored in the browser's indexeddb.

Install from source

Install dev environment with docker

docker-compose up

This will set up 2 containers: basin-client and basin-server

That's it. Point your browser to http://localhost:8860 and you're done!

To run npm commands in the basin-client container use:

docker exec basin-client npm <command>

To update changes in py files (block templates, lib), use:

docker exec basin-client npm run build-py

Getting started

Creating sources

A source defines the information needed to parse and import a dataset. Sources are referenced when using an Extract block. The source defines the following information:

type of file (delimited, fixed width, json, parquet)
regular expression to match when identifying the file. This will match against the file name
information about headers and footers
specific metadata based on type of file (for csv includes the delimiter etc)

Creating a flow

Running and debugging a flow

Exporting to python code

Configuration

Extending

Creating new block types

Each block type consists of:

Descriptor json
code template
optional code library template
Properties panel

Descriptor

Code template

Ccode library template

Properties panel

License

This program is free software: you can redistribute it and/or modify it under the terms of the Server Side Public License, version 1, as published by MongoDB, Inc. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Server Side Public License for more details. You should have received a copy of the Server Side Public License along with this program. If not, see http://www.mongodb.com/licensing/server-side-public-license

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

basin-etl / basin

Programming Languages

Labels

Projects that are alternatives of or similar to basin

Basin

Features

Install

Install from dockerhub

Create data folder

Run image

Install from source

Install dev environment with docker

Getting started

Creating sources

Creating a flow

Running and debugging a flow

Exporting to python code

Configuration

Extending

Creating new block types

Descriptor

Code template

Ccode library template

Properties panel

License