All Projects → AbsaOSS → Abris

AbsaOSS / Abris

Licence: apache-2.0
Avro SerDe for Apache Spark structured APIs.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Abris

Kafka Storm Starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Stars: ✭ 728 (+460%)
Mutual labels:  kafka, spark, avro
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-29.23%)
Mutual labels:  kafka, spark
Kaufmann ex
Kafka backed service library.
Stars: ✭ 86 (-33.85%)
Mutual labels:  kafka, avro
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+8354.62%)
Mutual labels:  kafka, spark
Rumble
⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-55.38%)
Mutual labels:  spark, avro
Thingsboard
Open-source IoT Platform - Device management, data collection, processing and visualization.
Stars: ✭ 10,526 (+7996.92%)
Mutual labels:  kafka, spark
Logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-25.38%)
Mutual labels:  kafka, spark
Delta Architecture
Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline
Stars: ✭ 43 (-66.92%)
Mutual labels:  kafka, spark
Schema Registry
A CLI and Go client for Kafka Schema Registry
Stars: ✭ 105 (-19.23%)
Mutual labels:  kafka, avro
Seldon Server
Machine Learning Platform and Recommendation Engine built on Kubernetes
Stars: ✭ 1,435 (+1003.85%)
Mutual labels:  kafka, spark
Flink Learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Stars: ✭ 11,378 (+8652.31%)
Mutual labels:  kafka, spark
Model Serving Tutorial
Code and presentation for Strata Model Serving tutorial
Stars: ✭ 57 (-56.15%)
Mutual labels:  kafka, spark
Awesome Recommendation Engine
The purpose of this tiny project is to put things together with the know how that i learned from the course big data expert from formacionhadoop.com The idea is to show how to play with apache spark streaming, kafka,mongo, spark machine learning algorithms.
Stars: ✭ 47 (-63.85%)
Mutual labels:  kafka, spark
Open Bank Mark
A bank simulation application using mainly Clojure, which can be used to end-to-end test and show some graphs.
Stars: ✭ 81 (-37.69%)
Mutual labels:  kafka, avro
Examples
Demo applications and code examples for Confluent Platform and Apache Kafka
Stars: ✭ 571 (+339.23%)
Mutual labels:  kafka, avro
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
Stars: ✭ 97 (-25.38%)
Mutual labels:  spark, avro
Example Spark Kafka
Apache Spark and Apache Kafka integration example
Stars: ✭ 120 (-7.69%)
Mutual labels:  kafka, spark
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Stars: ✭ 37 (-71.54%)
Mutual labels:  kafka, spark
Go Kafka Avro
A library provides consumer/producer to work with kafka, avro and schema registry
Stars: ✭ 39 (-70%)
Mutual labels:  kafka, avro
Bigdata Notebook
Stars: ✭ 100 (-23.08%)
Mutual labels:  kafka, spark

ABRiS - Avro Bridge for Spark

  • Pain free Spark/Avro integration.

  • Seamlessly convert your Avro records from anywhere (e.g. Kafka, Parquet, HDFS, etc) into Spark Rows.

  • Convert your Dataframes into Avro records without even specifying a schema.

  • Seamlessly integrate with Confluent platform, including Schema Registry with all available naming strategies and schema evolution.

  • Go back-and-forth Spark Avro (since Spark 2.4).

Coordinates for Maven POM dependency

Scala Abris
2.11 Maven Central
2.12 Maven Central

Supported Spark versions

On spark 3.0.x and 2.4.x Abris should work without any further requirements.

On Spark 2.3.x you must declare dependency on org.apache.avro:avro:1.8.0 or higher. (Spark 2.3.x uses Avro 1.7.x so you must overwrite this because ABRiS needs Avro 1.8.0+.)

Older Versions

This is documentation for Abris version 4. Documentation for version 3 is located in branch-3.2.

Usage

ABRiS API is in it's most basic form almost identical to Spark built-in support for Avro, but it provides additional functionality. Mainly it's support of schema registry and also seamless integration with confluent Avro data format.

The API consists of two Spark SQL expressions (to_avro and from_avro) and fluent configurator (AbrisConfig)

Using the configurator you can choose from four basic config types:

  • toSimpleAvro, toConfluentAvro, fromSimpleAvro and fromConfluentAvro

And configure what you want to do, mainly how to get the avro schema.

Example of usage:

val abrisConfig = AbrisConfig
  .fromConfluentAvro
  .downloadReaderSchemaByLatestVersion
  .andTopicNameStrategy("topic123")
  .usingSchemaRegistry("http://localhost:8081")

import za.co.absa.abris.avro.functions.from_avro
val deserialized = dataFrame.select(from_avro(col("value"), abrisConfig) as 'data)

Detailed instructions for many use cases are in separated documents:

Full runnable examples can be found in the za.co.absa.abris.examples package. You can also take a look at unit tests in package za.co.absa.abris.avro.sql.

IMPORTANT: Spark dependencies have provided scope in the pom.xml, so when running the examples, please make sure that you either, instruct your IDE to include dependencies with provided scope, or change the scope directly.

Confluent Avro format

The format of Avro binary data is defined in Avro specification. Confluent format extends it and prepends the schema id before the actual record. The Confluent expressions in this library expect this format and add the id after the Avro data are generated or remove it before they are parsed.

You can find more about Confluent and Schema Registry in Confluent documentation.

Schema Registry security and other additional settings

Only Schema registry client setting that is mandatory is the url, but if you need to provide more the configurer allows you to provide a whole map.

For example you may want to provide basic.auth.user.info and basic.auth.credentials.source required for user authentication. You can do this this way:

val registryConfig = Map(
  AbrisConfig.SCHEMA_REGISTRY_URL -> "http://localhost:8081",
  "basic.auth.credentials.source" -> "USER_INFO",
  "basic.auth.user.info" -> "srkey:srvalue"
)

val abrisConfig = AbrisConfig
  .fromConfluentAvro
  .downloadReaderSchemaByLatestVersion
  .andTopicNameStrategy("topic123")
  .usingSchemaRegistry(registryConfig) // use the map instead of just url

Other Features

Generating Avro schema from Spark data frame column

There is a helper method that allows you to generate schema automatically from spark column. Assuming you have a data frame containing column "input". You can generate schema for data in that column like this:

val schema = AvroSchemaUtils.toAvroSchema(dataFrame, "input")

Using schema manager to directly download or register schema

You can use SchemaManager directly to do operations with schema registry. The configuration is identical to Schema Registry Client. The SchemaManager is just a wrapper around the client providing helpful methods and abstractions.

val schemaRegistryClientConfig = Map( ...configuration... )
val schemaManager = SchemaManagerFactory.create(schemaRegistryClientConfig)

// Downloading schema:
val schema = schemaManager.getSchemaById(42)

// Registering schema:
val schemaString = "{...avro schema json...}"
val subject = SchemaSubject.usingTopicNameStrategy("fooTopic")
val schemaId = schemaManager.register(subject, schemaString)

// and more, check SchemaManager's methods

Data Conversions

This library also provides convenient methods to convert between Avro and Spark schemas.

If you have an Avro schema which you want to convert into a Spark SQL one - to generate your Dataframes, for instance - you can do as follows:

val avroSchema: Schema = AvroSchemaUtils.load("path_to_avro_schema")
val sqlSchema: StructType = SparkAvroConversions.toSqlType(avroSchema) 

You can also do the inverse operation by running:

val sqlSchema = new StructType(new StructField ....
val avroSchema = SparkAvroConversions.toAvroSchema(sqlSchema, avro_schema_name, avro_schema_namespace)

Multiple schemas in one topic

The naming strategies RecordName and TopicRecordName allow for a one topic to receive different payloads, i.e. payloads containing different schemas that do not have to be compatible, as explained here.

When you read such data from Kafka they will be stored as binary column in a dataframe, but once you convert them to Spark types they cannot be in one dataframe, because all rows in dataframe must have the same schema.

So if you have multiple incompatible types of avro data in a dataframe you must first sort them out to several dataframes. One for each schema. Then you can use Abris and convert the avro data.

Avro Fixed type

Fixed is an alternative way of encoding binary data in Avro. Unlike bytes type the fixed type doesn't store the length of the data in the payload, but in Avro schema itself.

The corresponding data type in Spark is BinaryType, but the inferred schema will always use bytes type for this kind of data. If you want to use the fixed type you must provide the appropriate Avro schema.


Copyright 2018 ABSA Group Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].