kamu-data / Kamu Cli
Programming Languages
Projects that are alternatives of or similar to Kamu Cli
Kamu
Welcome to kamu
- a new-generation data management and transformation tool!
About
kamu
is a reference implementation of Open Data Fabric - a Web 3.0 technology that powers a distributed structured data supply chain for providing timely, high-quality, and verifiable data for data science, smart contracts, web and applications.
Using kamu
you can become a member of the world's first peer-to-peer data pipeline that:
- Connects publishers and consumers of data worldwide.
- Enables effective collaboration of people around data transformation and cleaning.
- Ensures data propagates with minimal latency.
- Provides the most complete, secure, and fully accurate lineage and provenance information on where every piece of data came from and how it was produced.
- Guarantees reproducibility of all data workflows.
Documentation
- Installation
- First Steps
-
Examples
- Currency Conversion [temporal-table joins]
- Stock Market Trading Data Analysis [aggregations, temporal-table joins, watermarks, notebooks]
- Overdue Order Shipments Detection [stream-to-stream joins, watermarks]
- Housing Prices Analysis [GIS functions and joins, notebooks]
-
Ingesting Data
- Supported Formats
- Merge Strategies
-
Transforming Data
- Streaming Aggregations
- Temporal Table Joins
- Stream-to-Stream Joins
- Watermarks
- Geo-Spatial Data
- Exploring Data
- Sharing data
- Troubleshooting
-
Reference
- Metadata Schemas
- Supported Engines
- Supported Remotes
-
Contributing
- Contribution Guidelines
- Developer Guide
Learning Materials
- Kamu Blog: Introducing Open Data Fabric - a casual introduction.
- Kamu 101 - First Steps - a video overview of key features.
- Open Data Fabric protocol specification - technical overview and many gory details.
- Building a Distributed Collaborative Data Pipeline - Technical talk from Data+AI Summit 2020
Features
-
For Data Publishers
-
For Data Professionals
- Collaborate on cleaning and improving data of existing datasets
- Create derivative datasets by transforming, enriching, and summarizing data others have published
- Write query once - run it forever with one of our state of the art stream processing engines
- Always stay up-to-date by pulling latest updates from the data sources with just one command
- Built-in support for GIS data
-
For Data Consumers
- Download a dataset from a shared repository
- Easily verify that all data comes from trusted sources
- Audit the chain of transformations this data went through
- Validate that downloaded data was in fact produced by the declared transformations
-
For Data Exploration
Project Status Disclaimer
kamu
is an alpha quality software. Our main goal currently is to demonstrate the potential of the Open Data Fabric protocol and its transformative properties to the community and the industry and validate our ideas.
Naturally, we don't recommend using kamu
for any critical tasks - it's definitely not prod-ready. We are, however absolutely delighted to use kamu
for our personal data analytics needs and small projects, and hoping you will enjoy it too.
If you do - simply make sure to maintain your source data separately and don't rely on kamu
for data storage. This way any time a new version comes out that breaks some compatibility you can simply delete your kamu workspace and re-create it from scratch in a matter of seconds.
Also, please be patient with current performance and resource usage. We fully realize that waiting 15s to process a few KiB of CSV isn't great. Stream processing technologies is a relatively new area, and the data processing engines kamu
uses (e.g. Apache Spark and Flink) are tailored to run in large clusters, not on a laptop. They take a lot of resources to just boot up, so the start-stop-continue nature of kamu
's transformations is at odds with their design. We are hoping that the industry will recognize our use-case and expect to see a better support for it in future. We are committed to improving the performance significantly in the near future.