All Projects → delta-io → Delta

delta-io / Delta

Licence: apache-2.0
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.

Programming Languages

scala
5932 projects
python
139335 projects - #7 most used programming language
java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Delta

awesome-AI-kubernetes
❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
Stars: ✭ 95 (-97.57%)
Mutual labels:  big-data, spark, analytics
Hyperspace
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
Stars: ✭ 246 (-93.7%)
Mutual labels:  spark, analytics, big-data
Logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-97.51%)
Mutual labels:  spark, analytics, big-data
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-96.16%)
Mutual labels:  spark, analytics, big-data
spark-acid
ACID Data Source for Apache Spark based on Hive ACID
Stars: ✭ 91 (-97.67%)
Mutual labels:  big-data, spark, acid
Geopyspark
GeoTrellis for PySpark
Stars: ✭ 167 (-95.72%)
Mutual labels:  spark, big-data
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (-25.72%)
Mutual labels:  spark, big-data
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (-94.47%)
Mutual labels:  spark, big-data
Sparkling Graph
SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.
Stars: ✭ 139 (-96.44%)
Mutual labels:  spark, big-data
Data Accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (-93.67%)
Mutual labels:  spark, big-data
Koalas
Koalas: pandas API on Apache Spark
Stars: ✭ 3,044 (-22.01%)
Mutual labels:  spark, big-data
Geni
A Clojure dataframe library that runs on Spark
Stars: ✭ 152 (-96.11%)
Mutual labels:  spark, big-data
Spark.jl
Julia binding for Apache Spark
Stars: ✭ 153 (-96.08%)
Mutual labels:  spark, big-data
Sparkrdma
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (-94.49%)
Mutual labels:  spark, big-data
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-97.16%)
Mutual labels:  big-data, spark
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-99.67%)
Mutual labels:  big-data, spark
bigdata-fun
A complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-99.64%)
Mutual labels:  big-data, spark
Opaque
An encrypted data analytics platform
Stars: ✭ 129 (-96.69%)
Mutual labels:  spark, analytics
Spark On Lambda
Apache Spark on AWS Lambda
Stars: ✭ 137 (-96.49%)
Mutual labels:  spark, big-data
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+17.37%)
Mutual labels:  analytics, big-data

Delta Lake Logo

Test License PyPI

Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines.

See the Delta Lake Documentation for details.

See the Quick Start Guide to get started with Scala, Java and Python.

Latest Binaries

See the online documentation for the latest release.

API Documentation

Compatibility

Compatibility with Apache Spark Versions

See the online documentation for the releases and their compatibility with Apache Spark versions.

API Compatibility

There are two types of APIs provided by the Delta Lake project.

  • Spark-based APIs - You can read Delta tables through the DataFrameReader/Writer (i.e. spark.read, df.write, spark.readStream and df.writeStream). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).
  • Direct Java/Scala/Python APIs - The classes and methods documented in the API docs are considered as stable public APIs. All other classes, interfaces, methods that may be directly accessible in code are considered internal, and they are subject to change across releases.

Data Storage Compatibility

Delta Lake guarantees backward compatibility for all Delta Lake tables (i.e., newer versions of Delta Lake will always be able to read tables written by older versions of Delta Lake). However, we reserve the right to break forward compatibility as new features are introduced to the transaction protocol (i.e., an older version of Delta Lake may not be able to read a table produced by a newer version).

Breaking changes in the protocol are indicated by incrementing the minimum reader/writer version in the Protocol action.

Roadmap

For detailed detailed timeline, see the project roadmap.

Building

Delta Lake is compiled using SBT.

To compile, run

build/sbt compile

To generate artifacts, run

build/sbt package

To execute tests, run

build/sbt test

Refer to SBT docs for more commands.

Transaction Protocol

Delta Transaction Log Protocol document provides a specification of the transaction protocol.

Requirements for Underlying Storage Systems

Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, we require the storage system to provide the following.

  1. Atomic visibility: There must be a way for a file to be visible in its entirety or not visible at all.
  2. Mutual exclusion: Only one writer must be able to create (or rename) a file at the final destination.
  3. Consistent listing: Once a file has been written in a directory, all future listings for that directory must return that file.

See the online documentation on Storage Configuration for details.

Concurrency Control

Delta Lake ensures serializability for concurrent reads and writes. Please see Delta Lake Concurrency Control for more details.

Reporting issues

We use GitHub Issues to track community reported issues. You can also contact the community for getting answers.

Contributing

We welcome contributions to Delta Lake. See our CONTRIBUTING.md for more details.

We also adhere to the Delta Lake Code of Conduct.

License

Apache License 2.0, see LICENSE.

Community

There are two mediums of communication within the Delta Lake community.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].