All Projects → InterestingLab → Waterdrop

InterestingLab / Waterdrop

Licence: apache-2.0
Production Ready Data Integration Product, documentation:

Programming Languages

java
68154 projects - #9 most used programming language
scala
5932 projects
shell
77523 projects

Projects that are alternatives of or similar to Waterdrop

seatunnel-example
seatunnel plugin developing examples.
Stars: ✭ 27 (-98.55%)
Mutual labels:  spark-streaming, flink, sql-engine, etl-framework, etl-pipeline
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-98.71%)
Mutual labels:  hadoop, etl-framework, etl-pipeline
Bigdata Notebook
Stars: ✭ 100 (-94.61%)
Mutual labels:  spark, hadoop, flink
fastdata-cluster
Fast Data Cluster (Apache Cassandra, Kafka, Spark, Flink, YARN and HDFS with Vagrant and VirtualBox)
Stars: ✭ 20 (-98.92%)
Mutual labels:  spark, hadoop, flink
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-97.9%)
Mutual labels:  hadoop, etl-framework, etl-pipeline
Big Whale
Spark、Flink等离线任务的调度以及实时任务的监控
Stars: ✭ 163 (-91.22%)
Mutual labels:  spark, hadoop, flink
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (-93.21%)
Mutual labels:  spark, hadoop, flink
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (-53.83%)
Mutual labels:  spark, hadoop, flink
Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (-55.5%)
Mutual labels:  spark, hadoop, flink
Bigdataguide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (-55.98%)
Mutual labels:  spark, hadoop, flink
God Of Bigdata
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
Stars: ✭ 6,008 (+223.71%)
Mutual labels:  spark, hadoop, flink
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-95.04%)
Mutual labels:  spark, hadoop, flink
Learning Spark
零基础学习spark,大数据学习
Stars: ✭ 37 (-98.01%)
Mutual labels:  spark, hadoop, spark-streaming
Dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (-35.61%)
Mutual labels:  spark, hadoop, flink
Model Serving Tutorial
Code and presentation for Strata Model Serving tutorial
Stars: ✭ 57 (-96.93%)
Mutual labels:  spark, flink
Pulsar Spark
When Apache Pulsar meets Apache Spark
Stars: ✭ 55 (-97.04%)
Mutual labels:  spark, flink
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57 (-96.93%)
Mutual labels:  spark, hadoop
Docker Hadoop
A Docker container with a full Hadoop cluster setup with Spark and Zeppelin
Stars: ✭ 54 (-97.09%)
Mutual labels:  spark, hadoop
Pyspark Examples
Code examples on Apache Spark using python
Stars: ✭ 58 (-96.87%)
Mutual labels:  spark, spark-streaming
Flink Learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Stars: ✭ 11,378 (+513.04%)
Mutual labels:  spark, flink

seatunnel

Backend Workflow


EN doc CN doc

SeaTunnel was formerly named Waterdrop , and renamed SeaTunnel since October 12, 2021.


SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies.

Why do we need SeaTunnel

SeaTunnel will do its best to solve the problems that may be encountered in the synchronization of massive data:

  • Data loss and duplication
  • Task accumulation and delay
  • Low throughput
  • Long cycle to be applied in the production environment
  • Lack of application running status monitoring

SeaTunnel use scenarios

  • Mass data synchronization
  • Mass data integration
  • ETL with massive data
  • Mass data aggregation
  • Multi-source data processing

Features of SeaTunnel  

  • Easy to use, flexible configuration, low code development
  • Real-time streaming
  • Offline multi-source data analysis
  • High-performance, massive data processing capabilities
  • Modular and plug-in mechanism, easy to extend
  • Support data processing and aggregation by SQL
  • Support Spark structured streaming
  • Support Spark 2.x

Workflow of SeaTunnel

seatunnel-workflow_en-US.png

Input[Data Source Input] -> Filter[Data Processing] -> Output[Result Output]  

The data processing pipeline is constituted by multiple filters to meet a variety of data processing needs. If you are accustomed to SQL, you can also directly construct a data processing pipeline by SQL, which is simple and efficient. Currently, the filter list supported by SeaTunnel is still being expanded. Furthermore, you can develop your own data processing plug-in, because the whole system is easy to expand.

Plugins supported by SeaTunnel  

  • Input plugin Fake, File, Hdfs, Kafka, S3, Socket, self-developed Input plugin

  • Filter plugin Add, Checksum, Convert, Date, Drop, Grok, Json, Kv, Lowercase, Remove, Rename, Repartition, Replace, Sample, Split, Sql, Table, Truncate, Uppercase, Uuid, Self-developed Filter plugin

  • Output plugin Elasticsearch, File, Hdfs, Jdbc, Kafka, Mysql, S3, Stdout, self-developed Output plugin

Environmental dependency

  1. java runtime environment, java >= 8

  2. If you want to run SeaTunnel in a cluster environment, any of the following Spark cluster environments is usable:

  • Spark on Yarn
  • Spark Standalone

If the data volume is small, or the goal is merely for functional verification, you can also start in local mode without a cluster environment, because SeaTunnel supports standalone operation. Note: SeaTunnel 2.0 supports running on Spark and Flink.

Downloads

Download address for run-directly software package :https://github.com/InterestingLab/SeaTunnel/releases

Quick start

Quick start: https://interestinglab.github.io/seatunnel-docs/#/zh-cn/v1/quick-start

Detailed documentation on SeaTunnel:https://interestinglab.github.io/seatunnel-docs/#/

Application practice cases

  • Weibo, Value-added Business Department Data Platform

Weibo business uses an internal customized version of SeaTunnel and its sub-project Guardian for SeaTunnel On Yarn task monitoring for hundreds of real-time streaming computing tasks.

  • Sina, Big Data Operation Analysis Platform

Sina Data Operation Analysis Platform uses SeaTunnel to perform real-time and offline analysis of data operation and maintenance for Sina News, CDN and other services, and write it into Clickhouse.

  • Sogou, Sogou Qiqian System

Sogou Qiqian System takes SeaTunnel as an ETL tool to help establish a real-time data warehouse system.

  • Qutoutiao, Qutoutiao Data Center

Qutoutiao Data Center uses SeaTunnel to support mysql to hive offline ETL tasks, real-time hive to clickhouse backfill technical support, and well covers most offline and real-time tasks needs.

  • Yixia Technology, Yizhibo Data Platform

  • Yonghui Superstores Founders' Alliance-Yonghui Yunchuang Technology, Member E-commerce Data Analysis Platform

SeaTunnel provides real-time streaming and offline SQL computing of e-commerce user behavior data for Yonghui Life, a new retail brand of Yonghui Yunchuang Technology.

  • Shuidichou, Data Platform

Shuidichou adopts SeaTunnel to do real-time streaming and regular offline batch processing on Yarn, processing 3~4T data volume average daily, and later writing the data to Clickhouse.

For more use cases, please refer to: https://interestinglab.github.io/seatunnel-docs/#/zh-cn/case_study/

Contribute ideas and code

Submit issues and advice: https://github.com/InterestingLab/SeaTunnel/issues

Contribute code: https://github.com/InterestingLab/SeaTunnel/pulls

Developer

Thanks to all developers https://github.com/InterestingLab/SeaTunnel/graphs/contributors  

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].