Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+200336.36%)

Mutual labels: spark, mapreduce

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+99818.18%)

Mutual labels: spark, mapreduce

Yandex Big Data Engineering

Stars: ✭ 17 (+54.55%)

Mutual labels: spark, mapreduce

Nd4j

Fast, Scientific and Numerical Computing for the JVM (NDArrays)

Stars: ✭ 1,742 (+15736.36%)

Mutual labels: spark, scientific-computing

data-algorithms-with-spark

O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian

Stars: ✭ 34 (+209.09%)

Mutual labels: spark, mapreduce

Dpark

Python clone of Spark, a MapReduce alike framework in Python

Stars: ✭ 2,668 (+24154.55%)

Mutual labels: spark, mapreduce

Cdap

An open source framework for building data analytic applications.

Stars: ✭ 509 (+4527.27%)

Mutual labels: spark, mapreduce

Bdp Dataplatform

大数据生态解决方案数据平台：基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。

Stars: ✭ 456 (+4045.45%)

Mutual labels: spark, mapreduce

Mobius

C# and F# language binding and extensions to Apache Spark

Stars: ✭ 929 (+8345.45%)

Mutual labels: spark, mapreduce

Spark Scala Tutorial

A free tutorial for Apache Spark.

Stars: ✭ 907 (+8145.45%)

Mutual labels: spark

Coursera Uw Machine Learning Clustering Retrieval

Stars: ✭ 25 (+127.27%)

Mutual labels: mapreduce

Mathext

mathext implements basic elementary functions not included in the Go standard library [DEPRECATED]

Stars: ✭ 18 (+63.64%)

Mutual labels: scientific-computing

Edge

Extreme-scale Discontinuous Galerkin Environment (EDGE)

Stars: ✭ 18 (+63.64%)

Mutual labels: scientific-computing

Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu

Stars: ✭ 847 (+7600%)

Mutual labels: spark

Spark Swagger

Spark (http://sparkjava.com/) support for Swagger (https://swagger.io/)

Stars: ✭ 25 (+127.27%)

Mutual labels: spark

Gush

Fast and distributed workflow runner using ActiveJob and Redis

Stars: ✭ 894 (+8027.27%)

Mutual labels: parallelization

View All Similar Projects ➔

MaRe 🐳

Italian, pronounced: /ˈmare/. Noun: Sea.

MaRe (formerly EasyMapReduce) leverages the power of Docker and Spark to run and scale your serial tools in MapReduce fashion.

20 minutes introduction video:

What is MaRe
Example: DNA GC count
Getting Started
- Get MaRe
- Documentation

What is MaRe

MaRe has been developed with scientific application in mind. High-throughput methods produced massive datasets in the past decades, and using frameworks like Spark and Hadoop is a natural choice to enable high-throughput analysis. In scientific applications, many tools are highly optimized to resemble, or detect some phenomena that occur in a certain system. Hence, sometimes the effort of reimplementing scientific tools in Spark or Hadoop can't be sustained by research groups. MaRe aims to provide the means to run existing serial tools in MapReduce fashion. Since many of the available scientific tools are trivially parallelizable, MapReduce is an excellent paradigm that can be used to parallelize the computation.

Scientific tools often have many dependencies and, generally speaking, it's difficult for the system administrator to maintain software, which may be installed on each node of the cluster, in multiple version. Therefore, instead of running commands straight on the compute nodes, MaRe starts a user-provided Docker image that wraps a specific tool and all of its dependencies, and it runs the command inside the Docker container. The data goes from Spark through the Docker container, and back to Spark after being processed, via Unix files. If the TMPDIR environment variable in the worker nodes points to a tmpfs very little overhead should occur.

Example: DNA GC count

DNA can be represented as a string written in a language of 4 characters: a,t,g,c. Counting how many times g and c occur in a genome is a task that is often performed in genomics. In this example we use MaRe to perform this task in parallel with POSIX commands.

val rdd = sc.textFile("genome.txt")
val res = new MaRe(rdd)
  .map(
    inputMountPoint = TextFile("/dna"),
    outputMountPoint = TextFile("/count"),
    imageName = "busybox:1",
    command = "grep -o '[gc]' /dna | wc -l > /count")
  .reduce(
    inputMountPoint = TextFile("/counts"),
    outputMountPoint = TextFile("/sum"),
    imageName = "busybox:1",
    command = "awk '{s+=$1} END {print s}' /counts > /sum")
  .rdd.collect()(0)
println(s"The GC count is: $res")

In the previous example we work with single text file (genome.txt), which is splitted line by line and partitioned through the executors. MaRe also supports working with multiple text or binary files. The following example does the GC count from a set of gzipped DNA strings.

val rdd = sc.binaryFiles("genome_*.gz")
  .map { case (path, data) => (path, data.toArray) }
val res = new MaRe(rdd)
  .map(
    inputMountPoint = BinaryFiles("/zipped"),
    outputMountPoint = WholeTextFiles("/counts"),
    imageName = "busybox:1",
    command =
      """
      for filename in /zipped/*; do
        out=$(basename "${filename}" .gz)
        gunzip -c $filename | grep -o '[gc]' /dna | wc -l > /counts/${out}.sum
      done
      """)
  .reduce(
    inputMountPoint = WholeTextFiles("/counts"),
    outputMountPoint = WholeTextFiles("/sum"),
    imageName = "busybox:1",
    command = "awk '{s+=$1} END {print s}' /counts/*.sum > /sum/${RANDOM}.sum")
  .rdd.collect()(0)
println(s"The GC count is: $res")

Getting started

MaRe comes as a Scala library that you can use in your Spark applications. Please keep in mind that when submitting MaRe applications, Docker needs to be installed and properly configured on each worker node of your Spark cluster. Also, the user that runs the Spark job needs to be in the Docker group.

Get MaRe

MaRe is packaged and distributed with Maven, all you have to do is to add its dependency to your pom.xml file:

<dependencies>
  ...
  <dependency>
    <groupId>se.uu.it</groupId>
    <artifactId>mare</artifactId>
    <version>0.3.0</version>
  </dependency>
  ...
</dependencies>

Documentation

API documentation is available here: https://mcapuccini.github.io/MaRe/scaladocs/.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 11

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

mcapuccini / Mare

Programming Languages

Labels

Projects that are alternatives of or similar to Mare

MaRe 🐳

Table of contents

What is MaRe

Example: DNA GC count

Getting started

Get MaRe

Documentation