All Projects → smart-data-lake → smart-data-lake

smart-data-lake / smart-data-lake

Licence: GPL-3.0 license
Smart Automation Tool for building modern Data Lakes and Data Pipelines

Programming Languages

scala
5932 projects
java
68154 projects - #9 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to smart-data-lake

Haproxy Configs
80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.
Stars: ✭ 106 (+34.18%)
Mutual labels:  hive, hadoop
bigdata-doc
大数据学习笔记,学习路线,技术案例整理。
Stars: ✭ 37 (-53.16%)
Mutual labels:  hive, hadoop
Avro Hadoop Starter
Example MapReduce jobs in Java, Hive, Pig, and Hadoop Streaming that work on Avro data.
Stars: ✭ 110 (+39.24%)
Mutual labels:  hive, hadoop
Wifi
基于wifi抓取信息的大数据查询分析系统
Stars: ✭ 93 (+17.72%)
Mutual labels:  hive, hadoop
dpkb
大数据相关内容汇总,包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词:Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse
Stars: ✭ 123 (+55.7%)
Mutual labels:  hive, hadoop
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (+16.46%)
Mutual labels:  hive, hadoop
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (+59.49%)
Mutual labels:  hive, hadoop
Hive Funnel Udf
Hive UDFs for funnel analysis
Stars: ✭ 72 (-8.86%)
Mutual labels:  hive, hadoop
Presto
The official home of the Presto distributed SQL query engine for big data
Stars: ✭ 12,957 (+16301.27%)
Mutual labels:  hive, hadoop
Movie recommend
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Stars: ✭ 2,092 (+2548.1%)
Mutual labels:  hive, hadoop
Hadoop cookbook
Cookbook to install Hadoop 2.0+ using Chef
Stars: ✭ 82 (+3.8%)
Mutual labels:  hive, hadoop
Hive Jdbc Uber Jar
Hive JDBC "uber" or "standalone" jar based on the latest Apache Hive version
Stars: ✭ 188 (+137.97%)
Mutual labels:  hive, hadoop
Dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (+1412.66%)
Mutual labels:  hive, hadoop
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+13812.66%)
Mutual labels:  hive, hadoop
Apache Spark Hands On
Educational notes,Hands on problems w/ solutions for hadoop ecosystem
Stars: ✭ 74 (-6.33%)
Mutual labels:  hive, hadoop
Datax
DataX is an open source universal ETL tool that support Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server
Stars: ✭ 116 (+46.84%)
Mutual labels:  hive, hadoop
Bigdataguide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (+934.18%)
Mutual labels:  hive, hadoop
Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (+945.57%)
Mutual labels:  hive, hadoop
Eel Sdk
Big Data Toolkit for the JVM
Stars: ✭ 140 (+77.22%)
Mutual labels:  hive, hadoop
Bigdata docker
Big Data Ecosystem Docker
Stars: ✭ 161 (+103.8%)
Mutual labels:  hive, hadoop

Smart Data Lake

Build Status

Smart Data Lake Builder is a data lake automation framework that makes loading and transforming data a breeze. It is implemented in Scala and builds on top of open-source big data technologies like Apache Hadoop and Apache Spark, including connectors for diverse data sources (HadoopFS, Hive, DeltaLake, JDBC, Splunk, Webservice, SFTP, JMS, Excel, Access) and file formats.

A Data Lake

  • is a central raw data store for analytics
  • facilitates cheap raw storage to handle growing volumes of data
  • enables topnotch artificial intelligence (AI) and machine learning (ML) technologies for data-driven enterprises

The Smart Data Lake adds

  • a layered data architecture to provide not only raw data, but prepared, secured, high quality data according to business entities, ready to use for analytical use cases, also called «Smart Data». This is comparable to Databricks Lake House architecture, in fact Smart Data Lake Builder is a very good choice to automate a Lake House, also on Databricks.
  • a declarative, configuration-driven approach to creating data pipelines. Metadata about data pipelines allows for efficient operations, maintenance and more business self-service.

Benefits of Smart Data Lake Builder

  • Cheaper implementation of data lakes
  • Increased productivity of data scientists
  • Higher level of self-service
  • Decreased operations and maintenance costs
  • Fully open source, no vendor lock-in

When should you consider using Smart Data Lake Builder ?

Some common use cases include:

  • Building Data Lakes, drastically increasing productivity and usability
  • Data Apps - building complex data processing apps
  • DWH automation - reading and writing to relational databases via SQL
  • Data migration - Efficiently create one-time data pipelines
  • Data Catalog / Data Lineage - Generated automatically from metadata

See Features for a comprehensive list of Smart Data Lake Builder features.

How it works

The following diagram shows the core concepts:

How it works

Data object

A data object defines the location and format of data. Some data objects require a connection to access remote data (e.g. a database connection).

Action

The "data processors" are called actions. An action requires at least one input and output data object. An action reads the data from the input data object, processes and writes it to the output data object. Many actions are predefined e.g. transform data from json to csv but you can also define your custom transformer action.

Feed

Actions connect different Data Object and implicitly define a directed acyclic graph, as they model the dependencies needed to fill a Data Object. This automatically generated, arbitrary complex data flow can be divided up into Feed's (subgraphs) for execution and monitoring.

Configuration

All metadata i.e. connections, data objects and actions are defined in a central configuration file, usually called application.conf. The file format used is HOCON which makes it easy to edit.

Getting Started

To see how all this works in action, head over to the Getting Started page.

Major Contributors

SBB
www.sbb.ch : Provided the previously developed software as a foundation for the open source project

ELCA
www.elca.ch : Did the comprehensive revision and provision as open source project

Additional Documentation

Getting Started
Reference
Architecture
Testing
Glossary
Troubleshooting
FAQ
Contributing
Running in the Public Cloud

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].