All Projects → delta-io → kafka-delta-ingest

delta-io / kafka-delta-ingest

Licence: Apache-2.0 license
A highly efficient daemon for streaming data from Kafka into Delta Lake

Programming Languages

rust
11053 projects
shell
77523 projects

Projects that are alternatives of or similar to kafka-delta-ingest

Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Stars: ✭ 52 (-62.59%)
Mutual labels:  delta, deltalake
DeltaUI
SwiftUI + CoreData user interface for DeltaCore & Friends.
Stars: ✭ 61 (-56.12%)
Mutual labels:  delta
deltaq
Fast and portable delta encoding for .NET in 100% safe, managed code.
Stars: ✭ 26 (-81.29%)
Mutual labels:  delta
smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Stars: ✭ 79 (-43.17%)
Mutual labels:  deltalake
dipa
dipa makes it easy to efficiently delta encode large Rust data structures.
Stars: ✭ 243 (+74.82%)
Mutual labels:  delta
Delta
A syntax-highlighting pager for git, diff, and grep output
Stars: ✭ 11,555 (+8212.95%)
Mutual labels:  delta
Jsondiffpatch
Diff & patch JavaScript objects
Stars: ✭ 3,951 (+2742.45%)
Mutual labels:  delta
modmarg
Calculating Marginal Effects and Levels with Errors Using the Delta Method
Stars: ✭ 15 (-89.21%)
Mutual labels:  delta
ferryd
Fast, safe and reliable transit for the delivery of software updates to users.
Stars: ✭ 43 (-69.06%)
Mutual labels:  delta
graph-crdt
Commutative graphs made for real-time, offline-tolerant replication
Stars: ✭ 47 (-66.19%)
Mutual labels:  delta
ADE9078-3PhaseWattmeter
An Isolated design for a demo board using the Analog Devices ADE9078 3 phase AC wattmeter. Design allows both WYE (STAR) and Delta (TRIANGLE) distributions to be measured along with Blondel and non-Blondel measurement schemes. The project includes a SPI based Arduino style library.
Stars: ✭ 24 (-82.73%)
Mutual labels:  delta
tradingconv
Convert trading history of cryptocurrency platforms
Stars: ✭ 24 (-82.73%)
Mutual labels:  delta
delta-lake-internals
The Internals of Delta Lake
Stars: ✭ 108 (-22.3%)
Mutual labels:  deltalake
PysparkCheatsheet
PySpark Cheatsheet
Stars: ✭ 25 (-82.01%)
Mutual labels:  deltalake

kafka-delta-ingest

The kafka-delta-ingest project aims to build a highly efficient daemon for streaming data through Apache Kafka into Delta Lake.

This project is currently highly experimental and evolving in tandem with the delta-rs bindings.

Features

  • Multiple worker processes per stream

  • Basic transformations within message

  • Statsd metric output

See the design doc for more details.

Example

The repository includes an example for trying out the application locally with some fake web request data.

The included docker-compose.yml contains kafka and localstack services you can run kafka-delta-ingest against locally.

Starting Worker Processes

  1. Launch test services - docker-compose up setup

  2. Compile: cargo build

  3. Run kafka-delta-ingest against the web_requests example topic and table (customize arguments as desired):

export AWS_ENDPOINT_URL=http://0.0.0.0:4566
export AWS_ACCESS_KEY_ID=test
export AWS_SECRET_ACCESS_KEY=test

RUST_LOG=debug cargo run ingest web_requests ./tests/data/web_requests \
  -l 60 \
  -a web_requests \
  -t 'date: substr(meta.producer.timestamp, `0`, `10`)' \
      'meta.kafka.offset: kafka.offset' \
      'meta.kafka.partition: kafka.partition' \
      'meta.kafka.topic: kafka.topic' \
  -o earliest

Notes: * The AWS_* environment variables are for S3 and are required by the delta-rs library. ** Above, AWS_ENDPOINT_URL points to localstack. * The Kafka broker is assumed to be at localhost:9092, use -k to override. * To clean data from previous local runs, execute ./bin/clean-example-data.sh. You’ll need to do this if you destroy your Kafka container between runs since your delta log directory will be out of sync with Kafka offsets.

Example Data

A tarball containing 100K line-delimited JSON messages is included in tests/json/web_requests-100K.json.tar.gz. Running ./bin/extract-example-json.sh will unpack this to the expected location.

Pretty-printed example from the file
{
  "meta": {
    "producer": {
      "timestamp": "2021-03-24T15:06:17.321710+00:00"
    }
  },
  "method": "DELETE",
  "session_id": "7c28bcf9-be26-4d0b-931a-3374ab4bb458",
  "status": 204,
  "url": "http://www.youku.com",
  "uuid": "831c6afa-375c-4988-b248-096f9ed101f8"
}

After extracting the example data, you’ll need to play this onto the web_requests topic of the local Kafka container.

Note
URLs sampled for the test data are sourced from Wikipedia’s list of most popular websites - https://en.wikipedia.org/wiki/List_of_most_popular_websites.
Inspect example output
  • List data files - ls tests/data/web_requests/date=2021-03-24

  • List delta log files - ls tests/data/web_requests/_delta_log

  • Show some parquet data (using parquet-tools)

    • parquet-tools show tests/data/web_requests/date=2021-03-24/<some file written by your example>

Kafka SSL

In case you have Kafka topics secured by SSL client certificates, you can specify these secrets as environment variables.

For the cert chain include the PEM content as an environment variable named KAFKA_DELTA_INGEST_CERT. For the cert private key include the PEM content as an environment variable named KAFKA_DELTA_INGEST_KEY.

These will be set as the ssl.certificate.pem and ssl.key.pem Kafka settings respectively.

Make sure to provide the additional option:

-K security.protocol=SSL

when invoking the cli command as well.

Developing

Make sure the docker-compose setup has been ran, and execute cargo test to run unit and integration tests.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].