apache / Gobblin

Licence: apache-2.0

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Programming Languages

java

68154 projects - #9 most used programming language

shell

77523 projects

python

139335 projects - #7 most used programming language

javascript

184084 projects - #8 most used programming language

CSS

56736 projects

HTML

75241 projects

Projects that are alternatives of or similar to Gobblin

sqstorage

A easy to use and quick way to organize your inventory, storages and storage areas

Stars: ✭ 18 (-99.1%)

Mutual labels: management, apache

roxy-wi

Web interface for managing Haproxy, Nginx, Apache and Keepalived servers

Stars: ✭ 1,109 (-44.72%)

Mutual labels: management, apache

Poiji

🍬 A tiny library converting excel rows to a list of Java objects based on Apache POI

Stars: ✭ 255 (-87.29%)

Mutual labels: data, apache

Datacompy

Pandas and Spark DataFrame comparison for humans

Stars: ✭ 147 (-92.67%)

Mutual labels: data

Azkarra Streams

🚀 Azkarra is a lightweight java framework to make it easy to develop, deploy and manage cloud-native streaming microservices based on Apache Kafka Streams.

Stars: ✭ 146 (-92.72%)

Mutual labels: data

Ifarm

后台管理系统，前后端分离，后端SpringBoot+Shiro+MyBatis+Redis，前端Vue+ElementUI+Axios

Stars: ✭ 151 (-92.47%)

Mutual labels: management

Seaweedfs

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.

Stars: ✭ 13,380 (+567%)

Mutual labels: replication

Trytond

Mirror of trytond

Stars: ✭ 145 (-92.77%)

Mutual labels: management

Correios

A client library for Brazilian Correios APIs and services (SIGEP & SRO).

Stars: ✭ 153 (-92.37%)

Mutual labels: apache

Tera

An Internet-Scale Database.

Stars: ✭ 1,846 (-7.98%)

Mutual labels: data

Audioowl

Fast and simple music and audio analysis using RNN in Python 🕵️‍♀️ 🥁

Stars: ✭ 151 (-92.47%)

Mutual labels: data

Novosga

Sistema de Gerenciamento de Atendimento adaptável para grandes e pequenas organizações.

Stars: ✭ 149 (-92.57%)

Mutual labels: management

Immudb

immudb - world’s fastest immutable database, built on a zero trust model

Stars: ✭ 3,743 (+86.59%)

Mutual labels: replication

App Dirs Rs

Put your Rust app's data in the right place on every platform

Stars: ✭ 147 (-92.67%)

Mutual labels: data

Holiday Cn

📅🇨🇳 中国法定节假日数据自动每日抓取国务院公告

Stars: ✭ 157 (-92.17%)

Mutual labels: data

Elephant Shed

PostgreSQL Management Appliance

Stars: ✭ 146 (-92.72%)

Mutual labels: management

Anaconda Project

Tool for encapsulating, running, and reproducing data science projects

Stars: ✭ 153 (-92.37%)

Mutual labels: data

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (-92.52%)

Mutual labels: apache

Pyfunctional

Python library for creating data pipelines with chain functional programming

Stars: ✭ 1,943 (-3.14%)

Mutual labels: data

Geodev Hackerlabs

A place to learn how to build geo apps with the ArcGIS Platform.

Stars: ✭ 151 (-92.47%)

Mutual labels: data

View All Similar Projects ➔

Apache Gobblin

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
Data Organization within the lake (e.g. compaction, partitioning, deduplication)
Lifecycle Management of data within the lake (e.g. data retention)
Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
Supports stream and batch execution modes
Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

Java >= 1.8

If building the distribution with tests turned on:

Maven version 3.5.3

Instructions to run Apache RAT (Release Audit Tool)

Extract the archive file to your local directory.
Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

Extract the archive file to your local directory.
Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

apache / Gobblin

Programming Languages

Labels

Projects that are alternatives of or similar to Gobblin

Apache Gobblin

Capabilities

Highlights

Common Patterns used in production

Apache Gobblin is NOT

Requirements

Instructions to run Apache RAT (Release Audit Tool)

Instructions to build the distribution

Quick Links