All Projects → lensesio → fast-avro-write

lensesio / fast-avro-write

Licence: Apache-2.0 license
Writing an Avro file is not as fast as you might want it. This is a library to write considerably faster to an avro file.

Programming Languages

scala
5932 projects
java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to fast-avro-write

Avro
A fast Go Avro codec
Stars: ✭ 132 (+312.5%)
Mutual labels:  avro
Bigdata Playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (+453.13%)
Mutual labels:  avro
schema-registry-php-client
A PHP 7.3+ API client for the Confluent Schema Registry REST API based on Guzzle 6 - http://docs.confluent.io/current/schema-registry/docs/index.html
Stars: ✭ 40 (+25%)
Mutual labels:  avro
Kafka Connect Mongodb
**Unofficial / Community** Kafka Connect MongoDB Sink Connector - Find the official MongoDB Kafka Connector here: https://www.mongodb.com/kafka-connector
Stars: ✭ 137 (+328.13%)
Mutual labels:  avro
Mongo Kafka
MongoDB Kafka Connector
Stars: ✭ 166 (+418.75%)
Mutual labels:  avro
Mu Haskell
Mu (μ) is a purely functional framework for building micro services.
Stars: ✭ 215 (+571.88%)
Mutual labels:  avro
Slimmessagebus
Lightweight message bus interface for .NET (pub/sub and request-response) with transport plugins for popular message brokers.
Stars: ✭ 120 (+275%)
Mutual labels:  avro
kafka-scala-examples
Examples of Avro, Kafka, Schema Registry, Kafka Streams, Interactive Queries, KSQL, Kafka Connect in Scala
Stars: ✭ 53 (+65.63%)
Mutual labels:  avro
Gradle Avro Plugin
A Gradle plugin to allow easily performing Java code generation for Apache Avro. It supports JSON schema declaration files, JSON protocol declaration files, and Avro IDL files.
Stars: ✭ 176 (+450%)
Mutual labels:  avro
Vscode Data Preview
Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files
Stars: ✭ 245 (+665.63%)
Mutual labels:  avro
Noproto
Flexible, Fast & Compact Serialization with RPC
Stars: ✭ 138 (+331.25%)
Mutual labels:  avro
Avro
Apache Avro is a data serialization system.
Stars: ✭ 2,005 (+6165.63%)
Mutual labels:  avro
Jackson Dataformats Binary
Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Stars: ✭ 221 (+590.63%)
Mutual labels:  avro
Rq
Record Query - A tool for doing record analysis and transformation
Stars: ✭ 1,808 (+5550%)
Mutual labels:  avro
avro-serde-php
Avro Serialisation/Deserialisation (SerDe) library for PHP 7.3+ & 8.0 with a Symfony Serializer integration
Stars: ✭ 43 (+34.38%)
Mutual labels:  avro
Abris
Avro SerDe for Apache Spark structured APIs.
Stars: ✭ 130 (+306.25%)
Mutual labels:  avro
Kafkactl
Command Line Tool for managing Apache Kafka
Stars: ✭ 177 (+453.13%)
Mutual labels:  avro
parquet-flinktacular
How to use Parquet in Flink
Stars: ✭ 29 (-9.37%)
Mutual labels:  avro
sbt-avro
Plugin SBT to Generate Scala classes from Apache Avro schemas hosted on a remote Confluent Schema Registry.
Stars: ✭ 15 (-53.12%)
Mutual labels:  avro
Storagetapper
StorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service
Stars: ✭ 232 (+625%)
Mutual labels:  avro

Build Status Maven Central GitHub license

fast-avro-write

A small library allowing you to parallelize the write to an avro file thus achieving much better throughput

How to use it:

val datumWriter = new GenericDatumWriter[GenericRecord](schema)
val builder = FastDataFileWriterBuilder(datumWriter, out, schema)
    .withCodec(CodecFactory.snappyCodec())
    .withFlushOnEveryBlock(false)
    .withParallelization(parallelization)
    
builder.encoderFactory.configureBufferSize(4 * 1048576)
builder.encoderFactory.configureBlockSize(4 * 1048576)

val fileWriter = builder.build()
fileWriter.write(records)

This will write all the records to the file. If the records count passes a threshold it will parallelize the write. You can set the threshold as well; the write method takes a default parameter threshold. Simple!

Blog article

http://www.landoop.com/blog/2017/05/fast-avro-write/

Release History

0.2 - [2017-09-18] Upgrade to Avro 1.8.2

0.1 - [2017-04-02] Initial release

Performance

Run on 8GB, i7-4650U, SSD Here is the class from which the GenericRecords are created

case class StockQuote(symbol: String,
                      timestamp: Long,
                      ask: Double,
                      askSize: Int,
                      bid: Double,
                      bidSize: Int,
                      dayHigh: Double,
                      dayLow: Double,
                      lastTradeSize: Int,
                      lastTradeTime: Long,
                      open: Double,
                      previousClose: Double,
                      price: Double,
                      priceAvg50: Double,
                      priceAvg200: Double,
                      volume: Long,
                      yearHigh: Double,
                      yearLow: Double,
                      f1:String="value",
                      f2:String="value",
                      f3:String="value",
                      f4:String="value",
                      f5:String="value",
                      f6:String="value",
                      f7:String="value",
                      f8:String="value",
                      f9:String="value",
                      f10:String="value",
                      f11:String="value",
                      f12:String="value",
                      f13:String="value",
                      f14:String="value",
                      f15:String="value",
                      f16:String="value",
                      f17:String="value",
                      f18:String="value",
                      f19:String="value",
                      f20:String="value",
                      f21:String="value",
                      f22:String="value",
                      f23:String="value",
                      f24:String="value",
                      f25:String="value",
                      f26:String="value",
                      f27:String="value",
                      f28:String="value",
                      f29:String="value",
                      f30:String="value",
                      f31:String="value",
                      f32:String="value",
                      f33:String="value",
                      f34:String="value",
                      f35:String="value",
                      f36:String="value",
                      f37:String="value",
                      f38:String="value",
                      f39:String="value",
                      f40:String="value",
                      f41:String="value",
                      f42:String="value",
                      f43:String="value",
                      f44:String="value",
                      f45:String="value",
                      f46:String="value",
                      f47:String="value",
                      f48:String="value",
                      f49:String="value",
                      f50:String="value",
                      f51:String="value",
                      f52:String="value",
                      f53:String="value",
                      f54:String="value",
                      f55:String="value",
                      f56:String="value",
                      f57:String="value",
                      f58:String="value",
                      f59:String="value",
                      f60:String="value"
                     )

For each record count 10 runs have been made sequentially and the min and max values have been retained. All the values are in milliseconds For Fast writes different parallelization factor has been used - see p in the header

Record Count Standard Min Standard Max Fast Min (p=8) Fast Max (p=8) Fast Min (p=4) Fast Max (p=4) Fast Min (p=6) Fast Min (p=6)
100K 490 530 286 365 306 562 284 316
200K 981 1097 570 692 545 783 586 777
500K 2534 2755 1443 1575 1313 1607 1365 1402
1M 5079 5322 2853 2948 2571 2820 2816 2984
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].