delta-io / Delta

Licence: apache-2.0

An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.

Programming Languages

scala

5932 projects

python

139335 projects - #7 most used programming language

java

68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Delta

awesome-AI-kubernetes

❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc

Stars: ✭ 95 (-97.57%)

Mutual labels: big-data, spark, analytics

Hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

Stars: ✭ 246 (-93.7%)

Mutual labels: spark, analytics, big-data

Logisland

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

Stars: ✭ 97 (-97.51%)

Mutual labels: spark, analytics, big-data

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (-96.16%)

Mutual labels: spark, analytics, big-data

spark-acid

ACID Data Source for Apache Spark based on Hive ACID

Stars: ✭ 91 (-97.67%)

Mutual labels: big-data, spark, acid

Geopyspark

GeoTrellis for PySpark

Stars: ✭ 167 (-95.72%)

Mutual labels: spark, big-data

Mmlspark

Simple and Distributed Machine Learning

Stars: ✭ 2,899 (-25.72%)

Mutual labels: spark, big-data

Gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Stars: ✭ 216 (-94.47%)

Mutual labels: spark, big-data

Sparkling Graph

SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.

Stars: ✭ 139 (-96.44%)

Mutual labels: spark, big-data

Data Accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

Stars: ✭ 247 (-93.67%)

Mutual labels: spark, big-data

Koalas

Koalas: pandas API on Apache Spark

Stars: ✭ 3,044 (-22.01%)

Mutual labels: spark, big-data

Geni

A Clojure dataframe library that runs on Spark

Stars: ✭ 152 (-96.11%)

Mutual labels: spark, big-data

Spark.jl

Julia binding for Apache Spark

Stars: ✭ 153 (-96.08%)

Mutual labels: spark, big-data

Sparkrdma

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

Stars: ✭ 215 (-94.49%)

Mutual labels: spark, big-data

aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (-97.16%)

Mutual labels: big-data, spark

leaflet heatmap

简单的可视化湖州通话数据假设数据量很大，没法用浏览器直接绘制热力图，把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后，再使用Apache Spark绘制热力图，然后用leafletjs加载OpenStreetMap图层和热力图图层，以达到良好的交互效果。现在使用Apache Spark实现绘制，可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法，并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .

Stars: ✭ 13 (-99.67%)

Mutual labels: big-data, spark

bigdata-fun

A complete (distributed) BigData stack, running in containers

Stars: ✭ 14 (-99.64%)

Mutual labels: big-data, spark

Opaque

An encrypted data analytics platform

Stars: ✭ 129 (-96.69%)

Mutual labels: spark, analytics

Spark On Lambda

Apache Spark on AWS Lambda

Stars: ✭ 137 (-96.49%)

Mutual labels: spark, big-data

Trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Stars: ✭ 4,581 (+17.37%)

Mutual labels: analytics, big-data

View All Similar Projects ➔

Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines.

See the Delta Lake Documentation for details.

See the Quick Start Guide to get started with Scala, Java and Python.

Latest Binaries

See the online documentation for the latest release.

API Documentation

Compatibility

Compatibility with Apache Spark Versions

See the online documentation for the releases and their compatibility with Apache Spark versions.

API Compatibility

There are two types of APIs provided by the Delta Lake project.

Spark-based APIs - You can read Delta tables through the DataFrameReader/Writer (i.e. spark.read, df.write, spark.readStream and df.writeStream). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).
Direct Java/Scala/Python APIs - The classes and methods documented in the API docs are considered as stable public APIs. All other classes, interfaces, methods that may be directly accessible in code are considered internal, and they are subject to change across releases.

Data Storage Compatibility

Delta Lake guarantees backward compatibility for all Delta Lake tables (i.e., newer versions of Delta Lake will always be able to read tables written by older versions of Delta Lake). However, we reserve the right to break forward compatibility as new features are introduced to the transaction protocol (i.e., an older version of Delta Lake may not be able to read a table produced by a newer version).

Breaking changes in the protocol are indicated by incrementing the minimum reader/writer version in the Protocol action.

Roadmap

For detailed detailed timeline, see the project roadmap.

Building

Delta Lake is compiled using SBT.

To compile, run

build/sbt compile

To generate artifacts, run

build/sbt package

To execute tests, run

build/sbt test

Refer to SBT docs for more commands.

Transaction Protocol

Delta Transaction Log Protocol document provides a specification of the transaction protocol.

Requirements for Underlying Storage Systems

Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, we require the storage system to provide the following.

Atomic visibility: There must be a way for a file to be visible in its entirety or not visible at all.
Mutual exclusion: Only one writer must be able to create (or rename) a file at the final destination.
Consistent listing: Once a file has been written in a directory, all future listings for that directory must return that file.

See the online documentation on Storage Configuration for details.

Concurrency Control

Delta Lake ensures serializability for concurrent reads and writes. Please see Delta Lake Concurrency Control for more details.

Reporting issues

We use GitHub Issues to track community reported issues. You can also contact the community for getting answers.

Contributing

We welcome contributions to Delta Lake. See our CONTRIBUTING.md for more details.

We also adhere to the Delta Lake Code of Conduct.

License

Apache License 2.0, see LICENSE.

Community

There are two mediums of communication within the Delta Lake community.

Public Slack Channel
- Register here
- Login here
Public Mailing list

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

delta-io / Delta

Programming Languages

Labels

Projects that are alternatives of or similar to Delta

Latest Binaries

API Documentation

Compatibility

Compatibility with Apache Spark Versions

API Compatibility

Data Storage Compatibility

Roadmap

Building

Transaction Protocol

Requirements for Underlying Storage Systems

Concurrency Control

Reporting issues

Contributing

License

Community