Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

Stars: ✭ 337 (+212.04%)

Mutual labels: spark, distributed

Sparklyr

R interface for Apache Spark

Stars: ✭ 775 (+617.59%)

Mutual labels: spark, distributed

Ballista

Distributed compute platform implemented in Rust, and powered by Apache Arrow.

Stars: ✭ 2,274 (+2005.56%)

Mutual labels: spark, distributed

data processing course

Some class materials for a data processing course using PySpark

Stars: ✭ 50 (-53.7%)

Mutual labels: spark, data-processing

prosto

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

Stars: ✭ 54 (-50%)

Mutual labels: spark, data-processing

H2o 3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Stars: ✭ 5,656 (+5137.04%)

Mutual labels: spark, distributed

Pulsar Spark

When Apache Pulsar meets Apache Spark

Stars: ✭ 55 (-49.07%)

Mutual labels: spark, data-processing

Logigsk

A Linux based software package to control led's on Logitech G910, G810, G610 and G410.

Stars: ✭ 107 (-0.93%)

Mutual labels: spark

Smart Security Camera

A Pi Zero and Motion based webcamera that forwards images to Amazon Web Services for Image Processing

Stars: ✭ 103 (-4.63%)

Mutual labels: aws-lambda

Hark Lang

Build stateful and portable serverless applications without thinking about infrastructure.

Stars: ✭ 103 (-4.63%)

Mutual labels: aws-lambda

Cloud Game

Web-based Cloud Gaming service for Retro Game

Stars: ✭ 1,374 (+1172.22%)

Mutual labels: distributed

Serverless

⚡ Serverless Framework – Build web, mobile and IoT applications with serverless architectures using AWS Lambda, Azure Functions, Google CloudFunctions & more! –

Stars: ✭ 41,584 (+38403.7%)

Mutual labels: aws-lambda

Ipfs.ink

PROJECT HAS BEEN SHUTDOWN - Publish and render markdown essays to and from ipfs

Stars: ✭ 106 (-1.85%)

Mutual labels: distributed

Serverless Sharp

Serverless image optimizer for S3, Lambda, and Cloudfront

Stars: ✭ 102 (-5.56%)

Mutual labels: aws-lambda

Spark Terasort

Stars: ✭ 101 (-6.48%)

Mutual labels: spark

Bojack

🐴 The unreliable key-value store

Stars: ✭ 101 (-6.48%)

Mutual labels: distributed

View All Similar Projects ➔

distributed-dataset

A distributed data processing framework in pure Haskell. Inspired by Apache Spark.

An example: /examples/gh/Main.hs
API documentation: https://utdemir.github.io/distributed-dataset/
Introduction blogpost: https://utdemir.com/posts/ann-distributed-dataset.html

Packages

distributed-dataset

This package provides a Dataset type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.

It uses pluggable Backends for spawning executors and ShuffleStores for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.

It also exposes a more primitive Control.Distributed.Fork module which lets you run IO actions remotely. It is especially useful when your task is embarrassingly parallel.

distributed-dataset-aws

This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.

distributed-dataset-opendatasets

Provides Dataset's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.

Running the example

Clone the repository.

$ git clone https://github.com/utdemir/distributed-dataset
$ cd distributed-dataset

Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:
```
$ aws configure
```
Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:
```
$ aws s3api create-bucket --bucket my-s3-bucket
```

Build an run the example:

If you use Nix on Linux:

(Recommended) Use my binary cache on Cachix to reduce compilation times:

nix-env -i cachix # or your preferred installation method
cachix use utdemir

Then:

$ nix run -f ./default.nix example-gh -c example-gh my-s3-bucket

If you use stack (requires Docker, works on Linux and MacOS):

$ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket

Stability

Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.

Contributing

I am open to contributions; any issue, PR or opinion is more than welcome.

In order to develop distributed-dataset, you can use;
- On Linux: Nix, cabal-install or stack.
- On MacOS: stack with docker.
Use ormolu to format source code.

Nix

You can use my binary cache on cachix so that you don't recompile half of the Hackage.
nix-shell will drop you into a shell with ormolu, cabal-install and steeloverseer alongside with all required haskell and system dependencies. You can use cabal new-* commands there.
Easiest way to get a development environment would be to run sos at the top level directory inside of a nix-shell.

Stack

Make sure that you have Docker installed.
Use stack as usual, it will automatically use a Docker image
Run ./make.sh stack-build before you send a PR to test different resolvers.

Related Work

Papers

Towards Haskell in Cloud by Jeff Epstein, Andrew P. Black, Simon L. Peyton Jones
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

Projects

Apache Spark.
Sparkle: Run Haskell on top of Apache Spark.
HSpark: Another attempt at porting Apache Spark to Haskell.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 108

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (19) 🔗