All Projects → apache → Gobblin

apache / Gobblin

Licence: apache-2.0
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Programming Languages

java
68154 projects - #9 most used programming language
shell
77523 projects
python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
CSS
56736 projects
HTML
75241 projects

Projects that are alternatives of or similar to Gobblin

sqstorage
A easy to use and quick way to organize your inventory, storages and storage areas
Stars: ✭ 18 (-99.1%)
Mutual labels:  management, apache
roxy-wi
Web interface for managing Haproxy, Nginx, Apache and Keepalived servers
Stars: ✭ 1,109 (-44.72%)
Mutual labels:  management, apache
Poiji
🍬 A tiny library converting excel rows to a list of Java objects based on Apache POI
Stars: ✭ 255 (-87.29%)
Mutual labels:  data, apache
Datacompy
Pandas and Spark DataFrame comparison for humans
Stars: ✭ 147 (-92.67%)
Mutual labels:  data
Azkarra Streams
🚀 Azkarra is a lightweight java framework to make it easy to develop, deploy and manage cloud-native streaming microservices based on Apache Kafka Streams.
Stars: ✭ 146 (-92.72%)
Mutual labels:  data
Ifarm
后台管理系统,前后端分离,后端SpringBoot+Shiro+MyBatis+Redis,前端Vue+ElementUI+Axios
Stars: ✭ 151 (-92.47%)
Mutual labels:  management
Seaweedfs
SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.
Stars: ✭ 13,380 (+567%)
Mutual labels:  replication
Trytond
Mirror of trytond
Stars: ✭ 145 (-92.77%)
Mutual labels:  management
Correios
A client library for Brazilian Correios APIs and services (SIGEP & SRO).
Stars: ✭ 153 (-92.37%)
Mutual labels:  apache
Tera
An Internet-Scale Database.
Stars: ✭ 1,846 (-7.98%)
Mutual labels:  data
Audioowl
Fast and simple music and audio analysis using RNN in Python 🕵️‍♀️ 🥁
Stars: ✭ 151 (-92.47%)
Mutual labels:  data
Novosga
Sistema de Gerenciamento de Atendimento adaptável para grandes e pequenas organizações.
Stars: ✭ 149 (-92.57%)
Mutual labels:  management
Immudb
immudb - world’s fastest immutable database, built on a zero trust model
Stars: ✭ 3,743 (+86.59%)
Mutual labels:  replication
App Dirs Rs
Put your Rust app's data in the right place on every platform
Stars: ✭ 147 (-92.67%)
Mutual labels:  data
Holiday Cn
📅🇨🇳 中国法定节假日数据 自动每日抓取国务院公告
Stars: ✭ 157 (-92.17%)
Mutual labels:  data
Elephant Shed
PostgreSQL Management Appliance
Stars: ✭ 146 (-92.72%)
Mutual labels:  management
Anaconda Project
Tool for encapsulating, running, and reproducing data science projects
Stars: ✭ 153 (-92.37%)
Mutual labels:  data
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-92.52%)
Mutual labels:  apache
Pyfunctional
Python library for creating data pipelines with chain functional programming
Stars: ✭ 1,943 (-3.14%)
Mutual labels:  data
Geodev Hackerlabs
A place to learn how to build geo apps with the ArcGIS Platform.
Stars: ✭ 151 (-92.47%)
Mutual labels:  data

Apache Gobblin

Build Status Documentation Status Maven Central Stack Overflow Join us on Slack codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
  3. Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].