Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ondra-m → Ruby Spark

ondra-m / Ruby Spark

Licence: mit

Ruby wrapper for Apache Spark

Programming Languages

ruby

36898 projects - #4 most used programming language

Labels

spark distributed

Projects that are alternatives of or similar to Ruby Spark

Sparklyr

R interface for Apache Spark

Stars: ✭ 775 (+250.68%)

Mutual labels: spark, distributed

Ytk Learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

Stars: ✭ 337 (+52.49%)

Mutual labels: spark, distributed

Js Spark

Realtime calculation distributed system. AKA distributed lodash

Stars: ✭ 187 (-15.38%)

Mutual labels: spark, distributed

H2o 3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Stars: ✭ 5,656 (+2459.28%)

Mutual labels: spark, distributed

Distributed Dataset

A distributed data processing framework in Haskell.

Stars: ✭ 108 (-51.13%)

Mutual labels: spark, distributed

Xlearning Xdml

extremely distributed machine learning

Stars: ✭ 113 (-48.87%)

Mutual labels: spark, distributed

Ballista

Distributed compute platform implemented in Rust, and powered by Apache Arrow.

Stars: ✭ 2,274 (+928.96%)

Mutual labels: spark, distributed

Mmlspark

Simple and Distributed Machine Learning

Stars: ✭ 2,899 (+1211.76%)

Mutual labels: spark

Gerapy

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Stars: ✭ 2,601 (+1076.92%)

Mutual labels: distributed

Voik

♒︎ [WIP] An experimental ~distributed~ commit-log

Stars: ✭ 200 (-9.5%)

Mutual labels: distributed

Atomnas

Code for ICLR 2020 paper 'AtomNAS: Fine-Grained End-to-End Neural Architecture Search'

Stars: ✭ 197 (-10.86%)

Mutual labels: distributed

Improved Body Parts

Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation

Stars: ✭ 202 (-8.6%)

Mutual labels: distributed

Pottery

Redis for humans. 🌎🌍🌏

Stars: ✭ 204 (-7.69%)

Mutual labels: distributed

Spark Practice

Apache Spark (PySpark) Practice on Real Data

Stars: ✭ 200 (-9.5%)

Mutual labels: spark

Gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Stars: ✭ 216 (-2.26%)

Mutual labels: spark

Cookim

Distributed web chat application base websocket built on akka.

Stars: ✭ 198 (-10.41%)

Mutual labels: distributed

Vernemq

A distributed MQTT message broker based on Erlang/OTP. Built for high quality & Industrial use cases.

Stars: ✭ 2,628 (+1089.14%)

Mutual labels: distributed

Sparkrdma

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

Stars: ✭ 215 (-2.71%)

Mutual labels: spark

Example Spark

Spark, Spark Streaming and Spark SQL unit testing strategies

Stars: ✭ 205 (-7.24%)

Mutual labels: spark

Spark Knn

k-Nearest Neighbors algorithm on Spark

Stars: ✭ 205 (-7.24%)

Mutual labels: spark

View All Similar Projects ➔

Ruby-Spark

Apache Spark™ is a fast and general engine for large-scale data processing.

This Gem allows the use Spark functionality on Ruby.

Word count in Spark's Ruby API

file = spark.text_file("hdfs://...")

file.flat_map(:split)
    .map(lambda{|word| [word, 1]})
    .reduce_by_key(lambda{|a, b| a+b})

Installation

Requirments

Java 7+
Ruby 2+
wget or curl
MRI or JRuby

Add this line to your application's Gemfile:

gem 'ruby-spark'

And then execute:

$ bundle

Or install it yourself as:

$ gem install ruby-spark

Run rake compile if you are using gem from local filesystem.

Build Apache Spark

This command will download Spark and build extensions for this gem (SBT is used for compiling). For more informations check wiki. Jars will be stored at you HOME directory.

$ ruby-spark build

Usage

You can use Ruby Spark via interactive shell (Pry is used)

$ ruby-spark shell

Or on existing project.

If you want configure Spark first. See configurations for more details.

require 'ruby-spark'

# Configuration
Spark.config do
   set_app_name "RubySpark"
   set 'spark.ruby.serializer', 'oj'
   set 'spark.ruby.serializer.batch_size', 100
end

# Start Apache Spark
Spark.start

# Context reference
Spark.sc

Finally, to stop the cluster. On the shell is Spark stopped automatically when environment exit.

Spark.stop

After first use, global configuration is created at ~/.ruby-spark.conf. There can be specified properties for Spark and RubySpark.

Creating RDD (a new collection)

Single text file:

rdd = sc.text_file(FILE, workers_num, serializer=nil)

All files on directory:

rdd = sc.whole_text_files(DIRECTORY, workers_num, serializer=nil)

Direct uploading structures from ruby:

rdd = sc.parallelize([1,2,3,4,5], workers_num, serializer=nil)
rdd = sc.parallelize(1..5, workers_num, serializer=nil)

There is 2 conditions:

choosen serializer must be able to serialize it
data must be iterable

If you do not specified serializer -> default is used (defined from spark.ruby.serializer.* options). Check this if you want create custom serializer.

Operations

All operations can be divided into 2 groups:

Transformations: append new operation to current RDD and return new
Actions: add operation and start calculations

More informations:

You can also check official Spark documentation. First make sure that method is implemented here.

Transformations

rdd.map(function): Return a new RDD by applying a function to all elements of this RDD.
rdd.flat_map(function): Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
rdd.map_partitions(function): Return a new RDD by applying a function to each partition of this RDD.
rdd.filter(function): Return a new RDD containing only the elements that satisfy a predicate.
rdd.cartesian(other): Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements `(a, b)` where `a` is in `self` and `b` is in `other`.
rdd.intersection(other): Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.
rdd.sample(with_replacement, fraction, seed): Return a sampled subset of this RDD. Operations are base on Poisson and Uniform distributions.
rdd.group_by_key(num_partitions): Group the values for each key in the RDD into a single sequence.
...many more...

Actions

rdd.take(count): Take the first num elements of the RDD.
rdd.reduce(function): Reduces the elements of this RDD using the specified lambda or method.
rdd.aggregate(zero_value, seq_op, comb_op): Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral “zero value”.
rdd.histogram(buckets): Compute a histogram using the provided buckets.
rdd.collect: Return an array that contains all of the elements in this RDD.
...many more...

Examples

Basic methods

# Every batch will be serialized by Marshal and will have size 10
ser = Spark::Serializer.build('batched(marshal, 10)')

# Range 0..100, 2 workers, custom serializer
rdd = Spark.sc.parallelize(0..100, 2, ser)


# Take first 5 items
rdd.take(5)
# => [0, 1, 2, 3, 4]


# Numbers reducing
rdd.reduce(lambda{|sum, x| sum+x})
rdd.reduce(:+)
rdd.sum
# => 5050


# Reducing with zero items
seq = lambda{|x,y| x+y}
com = lambda{|x,y| x*y}
rdd.aggregate(1, seq, com)
# 1. Every workers adds numbers
#    => [1226, 3826]
# 2. Results are multiplied
#    => 4690676


# Statistic method
rdd.stats
# => StatCounter: (count, mean, max, min, variance,
#                  sample_variance, stdev, sample_stdev)


# Compute a histogram using the provided buckets.
rdd.histogram(2)
# => [[0.0, 50.0, 100], [50, 51]]


# Mapping
rdd.map(lambda {|x| x*2}).collect
# => [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, ...]
rdd.map(:to_f).collect
# => [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...]


# Mapping the whole collection
rdd.map_partitions(lambda{|part| part.reduce(:+)}).collect
# => [1225, 3825]


# Selecting
rdd.filter(lambda{|x| x.even?}).collect
# => [0, 2, 4, 6, 8, 10, 12, 14, 16, ...]


# Sampling
rdd.sample(true, 10).collect
# => [3, 36, 40, 54, 58, 82, 86, 95, 98]


# Sampling X items
rdd.take_sample(true, 10)
# => [53, 87, 71, 74, 18, 75, 55, 94, 46, 32]


# Using external process
rdd.pipe('cat', "awk '{print $1*10}'")
# => ["0", "10", "20", "30", "40", "50", ...]

Words count using methods

# Content:
# "first line"
# "second line"
rdd = sc.text_file(PATH)

# ["first", "line", "second", "line"]
rdd = rdd.flat_map(lambda{|line| line.split})

# [["first", 1], ["line", 1], ["second", 1], ["line", 1]]
rdd = rdd.map(lambda{|word| [word, 1]})

# [["first", 1], ["line", 2], ["second", 1]]
rdd = rdd.reduce_by_key(lambda{|a, b| a+b})

# {"first"=>1, "line"=>2, "second"=>1}
rdd.collect_as_hash

Estimating PI with a custom serializer

slices = 3
n = 100000 * slices

def map(_)
  x = rand * 2 - 1
  y = rand * 2 - 1

  if x**2 + y**2 < 1
    return 1
  else
    return 0
  end
end

rdd = Spark.context.parallelize(1..n, slices, serializer: 'oj')
rdd = rdd.map(method(:map))

puts 'Pi is roughly %f' % (4.0 * rdd.sum / n)

Estimating PI

rdd = sc.parallelize([10_000], 1)
rdd = rdd.add_library('bigdecimal/math')
rdd = rdd.map(lambda{|x| BigMath.PI(x)})
rdd.collect # => #<BigDecimal, '0.31415926...'>

Mllib (Machine Learning Library)

Mllib functions are using Spark's Machine Learning Library. Ruby objects are serialized and deserialized in Java so you cannot use custom classes. Supported are primitive types such as string or integers.

All supported methods/models:

Linear regression

# Import Mllib classes into Object
# Otherwise are accessible via Spark::Mllib::LinearRegressionWithSGD
Spark::Mllib.import(Object)

# Training data
data = [
  LabeledPoint.new(0.0, [0.0]),
  LabeledPoint.new(1.0, [1.0]),
  LabeledPoint.new(3.0, [2.0]),
  LabeledPoint.new(2.0, [3.0])
]

# Train a model
lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initial_weights: [1.0])

lrm.predict([0.0])

K-Mean

Spark::Mllib.import

# Dense vectors
data = [
  DenseVector.new([0.0,0.0]),
  DenseVector.new([1.0,1.0]),
  DenseVector.new([9.0,8.0]),
  DenseVector.new([8.0,9.0])
]

model = KMeans.train(sc.parallelize(data), 2)

model.predict([0.0, 0.0]) == model.predict([1.0, 1.0])
# => true
model.predict([8.0, 9.0]) == model.predict([9.0, 8.0])
# => true

Benchmarks

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 221

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (24) 🔗