All Projects → Refefer → Dampr

Refefer / Dampr

Licence: other
Python Data Processing library

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dampr

Asakusafw
Asakusa Framework
Stars: ✭ 114 (+11.76%)
Mutual labels:  mapreduce, batch-processing
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-9.8%)
Mutual labels:  mapreduce
Elixir Iteraptor
Handy enumerable operations implementation.
Stars: ✭ 55 (-46.08%)
Mutual labels:  mapreduce
Cekirdekler
Multi-device OpenCL kernel load balancer and pipeliner API for C#. Uses shared-distributed memory model to keep GPUs updated fast while using same kernel on all devices(for simplicity).
Stars: ✭ 76 (-25.49%)
Mutual labels:  batch-processing
Pyflowgraph
Python Module for displaying flowgraphs using Pyside or PyQt.
Stars: ✭ 61 (-40.2%)
Mutual labels:  dataflow
Intelmq Manager
IntelMQ Manager is a graphical interface to manage configurations for IntelMQ framework.
Stars: ✭ 81 (-20.59%)
Mutual labels:  dataflow
Batchman
This library for Android will take any set of events and batch them up before sending it to the server. It also supports persisting the events on disk so that no event gets lost because of an app crash. Typically used for developing any in-house analytics sdk where you have to make a single api call to push events to the server but you want to optimize the calls so that the api call happens only once per x events, or say once per x minutes. It also supports exponential backoff in case of network failures
Stars: ✭ 50 (-50.98%)
Mutual labels:  batch-processing
Gcp Variant Transforms
GCP Variant Transforms
Stars: ✭ 100 (-1.96%)
Mutual labels:  dataflow
Faast.js
Serverless batch computing made simple.
Stars: ✭ 1,323 (+1197.06%)
Mutual labels:  batch-processing
Google Cloud Eclipse
Google Cloud Platform plugin for Eclipse
Stars: ✭ 75 (-26.47%)
Mutual labels:  dataflow
Vue Dataflow Editor
Vue 2 dataflow graph editor
Stars: ✭ 73 (-28.43%)
Mutual labels:  dataflow
Src
A light-weight distributed stream computing framework for Golang
Stars: ✭ 67 (-34.31%)
Mutual labels:  mapreduce
Goflow
Flow-based and dataflow programming library for Go (golang)
Stars: ✭ 1,276 (+1150.98%)
Mutual labels:  dataflow
Batchimageprocessor
A Mass Image Processing tool for Windows
Stars: ✭ 55 (-46.08%)
Mutual labels:  batch-processing
Nextflow
A DSL for data-driven computational pipelines
Stars: ✭ 1,337 (+1210.78%)
Mutual labels:  dataflow
Pulsar Spark
When Apache Pulsar meets Apache Spark
Stars: ✭ 55 (-46.08%)
Mutual labels:  batch-processing
Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Stars: ✭ 71 (-30.39%)
Mutual labels:  mapreduce
Optbinning
Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning
Stars: ✭ 79 (-22.55%)
Mutual labels:  batch-processing
Derivatives
🌱 Your companion to create derived values from a single source (atom)
Stars: ✭ 101 (-0.98%)
Mutual labels:  dataflow
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+10675.49%)
Mutual labels:  mapreduce

Dampr - Pure Python Data Processing

Dampr is intended for use as single machine data processing: it's natively out of core, supports map and reduce side joins, associative reduce combiners, and provides a high level interface for constructing Dataflow DAGs.

It's reasonably fast, easy to get started, and scales linearly by core. It has no external dependencies, making it extremely lightweight and easy to install. It has reasonable REPL support for data analysis, though there are better tools for the job for it.

Features

  • Self-Contained: No external dependencies and simple to install
  • High-Level API: Easy computation
  • Out-Of-Core: Scales up to 100s of GB to TBs of data. No need to worry about Out of Memory errors!
  • Reasonably Fast: Linearly scales to number of cores on the machine
  • Powerful: Provides a number of advanced joins and other functions for complex workflows

Setup

pip install dampr

or

python setup.py install

API

docs/dampr/index.html

Examples

Look at the examples directory for more complete examples.

Similarly, the tests are intended to be fairly readable as well. You can view them in the tests directory.

Example - WC

import sys 

from dampr import Dampr

def main(fname):

    wc = Dampr.text(fname) \
            .map(lambda v: len(v.split())) \
            .a_group_by(lambda x: 1) \
            .sum()

    results = wc.run("word-count")
    for k, v in results:
        print("Word Count:", v)

    results.delete()

if __name__ == '__main__':
    main(sys.argv[1])

Why not Dask for data processing?

Dask is great! I'd highly recommend it for fast analytics and datasets which don't need complex joins!

However.

Dask is really intended for in-memory computation and more analytics processing via interfaces like DataFrames. While it does have a reasonable bag implementation for data processing, it's missing some important features such as joins across large datasets. I have routinely run into OOM errors when processing datasets larger than memory when trying more complicated processes.

In that sense, Dampr is attempting to bridge that gap of complex data processing on a single machine and heavy-weight systems, geared toward ease of use.

Why not PySpark for data processing?

PySpark is great! I'd highly recommend it for extremely large datasets and cluster computation!

However.

PySpark requires large amounts of setup to really get going. It's the antithesis of "light-weight" and really geared for large scale production deployments. I personally don't like it for proof of concepts or one-offs; it requires just a bit too much tuning to get what you need.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].