All Projects → Netflix → Iceberg

Netflix / Iceberg

Licence: apache-2.0
Iceberg is a table format for large, slow-moving tabular data

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Iceberg

Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+3.31%)
Mutual labels:  spark, hadoop, avro, parquet
Bigdata Playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (-54.96%)
Mutual labels:  hadoop, avro, parquet
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
Stars: ✭ 97 (-75.32%)
Mutual labels:  spark, avro, parquet
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+317.81%)
Mutual labels:  spark, hadoop, parquet
Rumble
⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-85.24%)
Mutual labels:  spark, avro, parquet
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-93.89%)
Mutual labels:  hadoop, avro, parquet
kafka-compose
🎼 Docker compose files for various kafka stacks
Stars: ✭ 32 (-91.86%)
Mutual labels:  spark, avro
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-96.69%)
Mutual labels:  spark, hadoop
Oap
Optimized Analytics Package for Spark* Platform
Stars: ✭ 343 (-12.72%)
Mutual labels:  spark, parquet
confluent-spark-avro
Spark UDFs to deserialize Avro messages with schemas stored in Schema Registry.
Stars: ✭ 18 (-95.42%)
Mutual labels:  spark, avro
experiments
Code examples for my blog posts
Stars: ✭ 21 (-94.66%)
Mutual labels:  spark, parquet
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-71.76%)
Mutual labels:  spark, hadoop
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (-93.64%)
Mutual labels:  spark, hadoop
yuzhouwan
Code Library for My Blog
Stars: ✭ 39 (-90.08%)
Mutual labels:  spark, hadoop
spark-util
low-level helpers for Apache Spark libraries and tests
Stars: ✭ 16 (-95.93%)
Mutual labels:  spark, hadoop
BigData-News
基于Spark2.2新闻网大数据实时系统项目
Stars: ✭ 36 (-90.84%)
Mutual labels:  spark, hadoop
swordfish
Open-source distribute workflow schedule tools, also support streaming task.
Stars: ✭ 35 (-91.09%)
Mutual labels:  spark, hadoop
bigdata-fun
A complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-96.44%)
Mutual labels:  spark, hadoop
Spline
Data Lineage Tracking And Visualization Solution
Stars: ✭ 306 (-22.14%)
Mutual labels:  spark, hadoop
Elasticluster
Create clusters of VMs on the cloud and configure them with Ansible.
Stars: ✭ 298 (-24.17%)
Mutual labels:  spark, hadoop

Iceberg has moved! Iceberg has been donated to the Apache Software Foundation.

Please use the new Apache mailing lists, site, and repository:

Iceberg is a new table format for storing large, slow-moving tabular data. It is designed to improve on the de-facto standard table layout built into Hive, Presto, and Spark.

Status

Iceberg is under active development at Netflix.

The core Java library that tracks table snapshots and metadata is complete, but still evolving. Current work is focused on integrating Iceberg into Spark and Presto.

The Iceberg format specification is being actively updated and is open for comment. Until the specification is complete and released, it carries no compatibility guarantees. The spec is currently evolving as the Java reference implementation changes.

Java API javadocs are available for the 0.3.0 (latest) release.

Collaboration

We welcome collaboration on both the Iceberg library and specification. The draft spec is open for comments.

For other discussion, please use the Iceberg mailing list or open issues on the Iceberg github page.

Building

Iceberg is built using Gradle 4.4.

Iceberg table support is organized in library modules:

  • iceberg-common contains utility classes used in other modules
  • iceberg-api contains the public Iceberg API
  • iceberg-core contains implementations of the Iceberg API and support for Avro data files, this is what processing engines should depend on
  • iceberg-parquet is an optional module for working with tables backed by Parquet files
  • iceberg-orc is an optional module for working with tables backed by ORC files (experimental)
  • iceberg-hive is am implementation of iceberg tables backed by hive metastore thrift client

This project Iceberg also has modules for adding Iceberg support to processing engines:

  • iceberg-spark is an implementation of Spark's Datasource V2 API for Iceberg (use iceberg-runtime for a shaded version)
  • iceberg-data is a client library used to read Iceberg tables from JVM applications
  • iceberg-pig is an implementation of Pig's LoadFunc API for Iceberg
  • iceberg-presto-runtime generates a shaded runtime jar that is used by presto to integrate with iceberg tables

Compatibility

Iceberg's Spark integration is compatible with the following Spark versions:

Iceberg version Spark version
0.2.0+ 2.3.0
0.3.0+ 2.3.2

About Iceberg

Overview

Iceberg tracks individual data files in a table instead of directories. This allows writers to create data files in-place and only adds files to the table in an explicit commit.

Table state is maintained in metadata files. All changes to table state create a new metadata file and replace the old metadata with an atomic operation. The table metadata file tracks the table schema, partitioning config, other properties, and snapshots of the table contents. Each snapshot is a complete set of data files in the table at some point in time. Snapshots are listed in the metadata file, but the files in a snapshot are stored in separate manifest files.

The atomic transitions from one table metadata file to the next provide snapshot isolation. Readers use the snapshot that was current when they load the table metadata and are not affected by changes until they refresh and pick up a new metadata location.

Data files in snapshots are stored in one or more manifest files that contain a row for each data file in the table, its partition data, and its metrics. A snapshot is the union of all files in its manifests. Manifest files can be shared between snapshots to avoid rewriting metadata that is slow-changing.

Design benefits

This design addresses specific problems with the hive layout: file listing is no longer used to plan jobs and files are created in place without renaming.

This also provides improved guarantees and performance:

  • Snapshot isolation: Readers always use a consistent snapshot of the table, without needing to hold a lock. All table updates are atomic.
  • O(1) RPCs to plan: Instead of listing O(n) directories in a table to plan a job, reading a snapshot requires O(1) RPC calls.
  • Distributed planning: File pruning and predicate push-down is distributed to jobs, removing the metastore as a bottleneck.
  • Version history and rollback: Table snapshots are kept as history and tables can roll back if a job produces bad data.
  • Finer granularity partitioning: Distributed planning and O(1) RPC calls remove the current barriers to finer-grained partitioning.
  • Enables safe file-level operations. By supporting atomic changes, Iceberg enables new use cases, like safely compacting small files and safely appending late data to tables.

Why a new table format?

There are several problems with the current format:

  • There is no specification. Implementations don’t handle all cases consistently. For example, bucketing in Hive and Spark use different hash functions and are not compatible. Hive uses a locking scheme to make cross-partition changes safe, but no other implementations use it.
  • The metastore only tracks partitions. Files within partitions are discovered by listing partition paths. Listing partitions to plan a read is expensive, especially when using S3. This also makes atomic changes to a table’s contents impossible. Netflix has developed custom Metastore extensions to swap partition locations, but these are slow because it is expensive to make thousands of updates in a database transaction.
  • Operations depend on file rename. Most output committers depend on rename operations to implement guarantees and reduce the amount of time tables only have partial data from a write. But rename is not a metadata-only operation in S3 and will copy data. The new S3 committers that use multipart upload make this better, but can’t entirely solve the problem and put a lot of load on the S3 index during job commit.

Table data is tracked in both a central metastore, for partitions, and the file system, for files. The central metastore can be a scale bottleneck and the file system doesn't---and shouldn't---provide transactions to isolate concurrent reads and writes. The current table layout cannot be patched to fix its major problems.

Other design goals

In addition to changes in how table contents are tracked, Iceberg's design improves a few other areas:

  • Schema evolution: Columns are tracked by ID to support add/drop/rename.
  • Reliable types: Iceberg uses a core set of types, tested to work consistently across all of the supported data formats.
  • Metrics: The format includes cost-based optimization metrics stored with data files for better job planning.
  • Invisible partitioning: Partitioning is built into Iceberg as table configuration; it can plan efficient queries without extra partition predicates.
  • Unmodified partition data: The Hive layout stores partition data escaped in strings. Iceberg stores partition data without modification.
  • Portable spec: Tables are not tied to Java. Iceberg has a clear specification for other implementations.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].