All Projects → starlake-ai → starlake

starlake-ai / starlake

Licence: Apache-2.0 license
Starlake is a Spark Based On Premise and Cloud ELT/ETL Framework for Batch & Stream Processing

Programming Languages

scala
5932 projects
java
68154 projects - #9 most used programming language
HCL
1544 projects
shell
77523 projects
python
139335 projects - #7 most used programming language
Mustache
554 projects

Projects that are alternatives of or similar to starlake

dbd
dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.
Stars: ✭ 30 (+87.5%)
Mutual labels:  bigquery, etl, snowflake, redshift
carto-spatial-extension
A set of UDFs and Procedures to extend BigQuery, Snowflake, Redshift and Postgres with Spatial Analytics capabilities
Stars: ✭ 131 (+718.75%)
Mutual labels:  bigquery, snowflake, redshift
growthbook
Open Source Feature Flagging and A/B Testing Platform
Stars: ✭ 2,342 (+14537.5%)
Mutual labels:  bigquery, snowflake, redshift
Tbls
tbls is a CI-Friendly tool for document a database, written in Go.
Stars: ✭ 940 (+5775%)
Mutual labels:  bigquery, snowflake, redshift
astro
Astro allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
Stars: ✭ 79 (+393.75%)
Mutual labels:  bigquery, etl, snowflake
Sql Runner
Run templatable playbooks of SQL scripts in series and parallel on Redshift, PostgreSQL, BigQuery and Snowflake
Stars: ✭ 68 (+325%)
Mutual labels:  bigquery, snowflake, redshift
dbt-ml-preprocessing
A SQL port of python's scikit-learn preprocessing module, provided as cross-database dbt macros.
Stars: ✭ 128 (+700%)
Mutual labels:  bigquery, snowflake, redshift
tellery
Tellery lets you build metrics using SQL and bring them to your team. As easy as using a document. As powerful as a data modeling tool.
Stars: ✭ 219 (+1268.75%)
Mutual labels:  bigquery, snowflake, redshift
Locopy
locopy: Loading/Unloading to Redshift and Snowflake using Python.
Stars: ✭ 73 (+356.25%)
Mutual labels:  etl, snowflake, redshift
BQconvert
BigQuery Schema Conversion Tool
Stars: ✭ 20 (+25%)
Mutual labels:  bigquery, redshift
pre-commit-dbt
🎣 List of `pre-commit` hooks to ensure the quality of your `dbt` projects.
Stars: ✭ 149 (+831.25%)
Mutual labels:  bigquery, snowflake
go-bqloader
bqloader is a simple ETL framework to load data from Cloud Storage into BigQuery.
Stars: ✭ 16 (+0%)
Mutual labels:  bigquery, etl
Sqlpad
Web-based SQL editor run in your own private cloud. Supports MySQL, Postgres, SQL Server, Vertica, Crate, ClickHouse, Trino, Presto, SAP HANA, Cassandra, Snowflake, BigQuery, SQLite, and more with ODBC
Stars: ✭ 4,113 (+25606.25%)
Mutual labels:  bigquery, snowflake
Redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Stars: ✭ 20,147 (+125818.75%)
Mutual labels:  bigquery, redshift
Ddlparse
DDL parase and Convert to BigQuery JSON schema and DDL statements
Stars: ✭ 52 (+225%)
Mutual labels:  bigquery, redshift
polygon-etl
ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (+231.25%)
Mutual labels:  bigquery, etl
Bitcoin Etl
ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 174 (+987.5%)
Mutual labels:  bigquery, etl
bigquery-kafka-connect
☁️ nodejs kafka connect connector for Google BigQuery
Stars: ✭ 17 (+6.25%)
Mutual labels:  bigquery, etl
etlflow
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.
Stars: ✭ 38 (+137.5%)
Mutual labels:  bigquery, etl
Ethereum Etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 956 (+5875%)
Mutual labels:  bigquery, etl

Build Status Scala Steward badge codecov Codacy Badge Documentation Maven Central Starlake Spark 3 discord License

About Starlake

Complete documentation available here

Introduction

The purpose of this project is to efficiently ingest various data sources in different formats and make them available for analytics. Usually, ingestion is done by writing hand made custom parsers that transform input files into datasets of records.

This project aims at automating this parsing task by making data ingestion purely declarative.

The workflow below is a typical use case :

  • Export your data as a set of DSV (Delimiter-separated values) or JSON files
  • Define each DSV/JSON file with a schema using YAML syntax
  • Configure the ingestion process
  • Start watching your data being available as Hive Tables in your datalake

The main advantages of the Starlake Data Pipeline project are :

  • Eliminates manual coding for data ingestion
  • Assign metadata to each dataset
  • Expose data ingestion metrics and history
  • Transform text files to strongly typed records
  • Support semantic types
  • Force privacy on specific fields (RGPD)
  • very, very simple piece of software to administer

How it works

Starlake Data Pipeline automates the loading and parsing of files and their ingestion into a Hadoop Datalake where datasets become available as Hive tables.

Complete Starlake Data Pipeline

  1. Landing Area : Files are first stored in the local file system
  2. Staging Area : Files associated with a schema are imported into the datalake
  3. Working Area : Staged Files are parsed against their schema and records are rejected or accepted and made available in parquet/orc/... files as Hive Tables.
  4. Business Area : Tables in the working area may be joined to provide a hoslictic view of the data through the definition of AutoJob.
  5. Data visualization : parquet/orc/... tables may be exposed in warehouses or elasticsearch indexes
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].