All Projects → utdemir → Distributed Dataset

utdemir / Distributed Dataset

Licence: bsd-3-clause
A distributed data processing framework in Haskell.

Programming Languages

haskell
3896 projects

Projects that are alternatives of or similar to Distributed Dataset

Js Spark
Realtime calculation distributed system. AKA distributed lodash
Stars: ✭ 187 (+73.15%)
Mutual labels:  spark, distributed
Ruby Spark
Ruby wrapper for Apache Spark
Stars: ✭ 221 (+104.63%)
Mutual labels:  spark, distributed
Xlearning Xdml
extremely distributed machine learning
Stars: ✭ 113 (+4.63%)
Mutual labels:  spark, distributed
Spark On Lambda
Apache Spark on AWS Lambda
Stars: ✭ 137 (+26.85%)
Mutual labels:  aws-lambda, spark
Ytk Learn
Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
Stars: ✭ 337 (+212.04%)
Mutual labels:  spark, distributed
Sparklyr
R interface for Apache Spark
Stars: ✭ 775 (+617.59%)
Mutual labels:  spark, distributed
Ballista
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Stars: ✭ 2,274 (+2005.56%)
Mutual labels:  spark, distributed
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (-53.7%)
Mutual labels:  spark, data-processing
prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
Stars: ✭ 54 (-50%)
Mutual labels:  spark, data-processing
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+5137.04%)
Mutual labels:  spark, distributed
Pulsar Spark
When Apache Pulsar meets Apache Spark
Stars: ✭ 55 (-49.07%)
Mutual labels:  spark, data-processing
Logigsk
A Linux based software package to control led's on Logitech G910, G810, G610 and G410.
Stars: ✭ 107 (-0.93%)
Mutual labels:  spark
Smart Security Camera
A Pi Zero and Motion based webcamera that forwards images to Amazon Web Services for Image Processing
Stars: ✭ 103 (-4.63%)
Mutual labels:  aws-lambda
Hark Lang
Build stateful and portable serverless applications without thinking about infrastructure.
Stars: ✭ 103 (-4.63%)
Mutual labels:  aws-lambda
Cloud Game
Web-based Cloud Gaming service for Retro Game
Stars: ✭ 1,374 (+1172.22%)
Mutual labels:  distributed
Serverless
⚡ Serverless Framework – Build web, mobile and IoT applications with serverless architectures using AWS Lambda, Azure Functions, Google CloudFunctions & more! –
Stars: ✭ 41,584 (+38403.7%)
Mutual labels:  aws-lambda
Ipfs.ink
PROJECT HAS BEEN SHUTDOWN - Publish and render markdown essays to and from ipfs
Stars: ✭ 106 (-1.85%)
Mutual labels:  distributed
Serverless Sharp
Serverless image optimizer for S3, Lambda, and Cloudfront
Stars: ✭ 102 (-5.56%)
Mutual labels:  aws-lambda
Spark Terasort
Spark Terasort
Stars: ✭ 101 (-6.48%)
Mutual labels:  spark
Bojack
🐴 The unreliable key-value store
Stars: ✭ 101 (-6.48%)
Mutual labels:  distributed

distributed-dataset

CI Status

A distributed data processing framework in pure Haskell. Inspired by Apache Spark.

Packages

distributed-dataset

This package provides a Dataset type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.

It uses pluggable Backends for spawning executors and ShuffleStores for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.

It also exposes a more primitive Control.Distributed.Fork module which lets you run IO actions remotely. It is especially useful when your task is embarrassingly parallel.

distributed-dataset-aws

This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.

distributed-dataset-opendatasets

Provides Dataset's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.

Running the example

  • Clone the repository.

    $ git clone https://github.com/utdemir/distributed-dataset
    $ cd distributed-dataset
    
  • Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:

    $ aws configure
    
  • Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:

    $ aws s3api create-bucket --bucket my-s3-bucket
    
  • Build an run the example:

    • If you use Nix on Linux:

      • (Recommended) Use my binary cache on Cachix to reduce compilation times:
      nix-env -i cachix # or your preferred installation method
      cachix use utdemir
      
      • Then:

        $ nix run -f ./default.nix example-gh -c example-gh my-s3-bucket
        
    • If you use stack (requires Docker, works on Linux and MacOS):

      $ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket
      

Stability

Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.

Contributing

I am open to contributions; any issue, PR or opinion is more than welcome.

  • In order to develop distributed-dataset, you can use;
    • On Linux: Nix, cabal-install or stack.
    • On MacOS: stack with docker.
  • Use ormolu to format source code.

Nix

  • You can use my binary cache on cachix so that you don't recompile half of the Hackage.
  • nix-shell will drop you into a shell with ormolu, cabal-install and steeloverseer alongside with all required haskell and system dependencies. You can use cabal new-* commands there.
  • Easiest way to get a development environment would be to run sos at the top level directory inside of a nix-shell.

Stack

  • Make sure that you have Docker installed.
  • Use stack as usual, it will automatically use a Docker image
  • Run ./make.sh stack-build before you send a PR to test different resolvers.

Related Work

Papers

Projects

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].