All Projects → mcapuccini → Mare

mcapuccini / Mare

Licence: apache-2.0
MaRe leverages the power of Docker and Spark to run and scale your serial tools in MapReduce fashion.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Mare

Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Stars: ✭ 71 (+545.45%)
Mutual labels:  spark, mapreduce
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+7690.91%)
Mutual labels:  spark, mapreduce
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (+736.36%)
Mutual labels:  spark, mapreduce
Data Algorithms Book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Stars: ✭ 949 (+8527.27%)
Mutual labels:  spark, mapreduce
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+200336.36%)
Mutual labels:  spark, mapreduce
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+99818.18%)
Mutual labels:  spark, mapreduce
Yandex Big Data Engineering
Stars: ✭ 17 (+54.55%)
Mutual labels:  spark, mapreduce
Nd4j
Fast, Scientific and Numerical Computing for the JVM (NDArrays)
Stars: ✭ 1,742 (+15736.36%)
Mutual labels:  spark, scientific-computing
data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Stars: ✭ 34 (+209.09%)
Mutual labels:  spark, mapreduce
Dpark
Python clone of Spark, a MapReduce alike framework in Python
Stars: ✭ 2,668 (+24154.55%)
Mutual labels:  spark, mapreduce
Cdap
An open source framework for building data analytic applications.
Stars: ✭ 509 (+4527.27%)
Mutual labels:  spark, mapreduce
Bdp Dataplatform
大数据生态解决方案数据平台:基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。
Stars: ✭ 456 (+4045.45%)
Mutual labels:  spark, mapreduce
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+8345.45%)
Mutual labels:  spark, mapreduce
Spark Scala Tutorial
A free tutorial for Apache Spark.
Stars: ✭ 907 (+8145.45%)
Mutual labels:  spark
Coursera Uw Machine Learning Clustering Retrieval
Stars: ✭ 25 (+127.27%)
Mutual labels:  mapreduce
Mathext
mathext implements basic elementary functions not included in the Go standard library [DEPRECATED]
Stars: ✭ 18 (+63.64%)
Mutual labels:  scientific-computing
Edge
Extreme-scale Discontinuous Galerkin Environment (EDGE)
Stars: ✭ 18 (+63.64%)
Mutual labels:  scientific-computing
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+7600%)
Mutual labels:  spark
Spark Swagger
Spark (http://sparkjava.com/) support for Swagger (https://swagger.io/)
Stars: ✭ 25 (+127.27%)
Mutual labels:  spark
Gush
Fast and distributed workflow runner using ActiveJob and Redis
Stars: ✭ 894 (+8027.27%)
Mutual labels:  parallelization

MaRe 🐳

Italian, pronounced: /ˈmare/. Noun: Sea.

Build Status Maven Central

MaRe (formerly EasyMapReduce) leverages the power of Docker and Spark to run and scale your serial tools in MapReduce fashion.

20 minutes introduction video:

youtube

Table of contents

What is MaRe

MaRe has been developed with scientific application in mind. High-throughput methods produced massive datasets in the past decades, and using frameworks like Spark and Hadoop is a natural choice to enable high-throughput analysis. In scientific applications, many tools are highly optimized to resemble, or detect some phenomena that occur in a certain system. Hence, sometimes the effort of reimplementing scientific tools in Spark or Hadoop can't be sustained by research groups. MaRe aims to provide the means to run existing serial tools in MapReduce fashion. Since many of the available scientific tools are trivially parallelizable, MapReduce is an excellent paradigm that can be used to parallelize the computation.

Scientific tools often have many dependencies and, generally speaking, it's difficult for the system administrator to maintain software, which may be installed on each node of the cluster, in multiple version. Therefore, instead of running commands straight on the compute nodes, MaRe starts a user-provided Docker image that wraps a specific tool and all of its dependencies, and it runs the command inside the Docker container. The data goes from Spark through the Docker container, and back to Spark after being processed, via Unix files. If the TMPDIR environment variable in the worker nodes points to a tmpfs very little overhead should occur.

Example: DNA GC count

DNA can be represented as a string written in a language of 4 characters: a,t,g,c. Counting how many times g and c occur in a genome is a task that is often performed in genomics. In this example we use MaRe to perform this task in parallel with POSIX commands.

val rdd = sc.textFile("genome.txt")
val res = new MaRe(rdd)
  .map(
    inputMountPoint = TextFile("/dna"),
    outputMountPoint = TextFile("/count"),
    imageName = "busybox:1",
    command = "grep -o '[gc]' /dna | wc -l > /count")
  .reduce(
    inputMountPoint = TextFile("/counts"),
    outputMountPoint = TextFile("/sum"),
    imageName = "busybox:1",
    command = "awk '{s+=$1} END {print s}' /counts > /sum")
  .rdd.collect()(0)
println(s"The GC count is: $res")

In the previous example we work with single text file (genome.txt), which is splitted line by line and partitioned through the executors. MaRe also supports working with multiple text or binary files. The following example does the GC count from a set of gzipped DNA strings.

val rdd = sc.binaryFiles("genome_*.gz")
  .map { case (path, data) => (path, data.toArray) }
val res = new MaRe(rdd)
  .map(
    inputMountPoint = BinaryFiles("/zipped"),
    outputMountPoint = WholeTextFiles("/counts"),
    imageName = "busybox:1",
    command =
      """
      for filename in /zipped/*; do
        out=$(basename "${filename}" .gz)
        gunzip -c $filename | grep -o '[gc]' /dna | wc -l > /counts/${out}.sum
      done
      """)
  .reduce(
    inputMountPoint = WholeTextFiles("/counts"),
    outputMountPoint = WholeTextFiles("/sum"),
    imageName = "busybox:1",
    command = "awk '{s+=$1} END {print s}' /counts/*.sum > /sum/${RANDOM}.sum")
  .rdd.collect()(0)
println(s"The GC count is: $res")

Getting started

MaRe comes as a Scala library that you can use in your Spark applications. Please keep in mind that when submitting MaRe applications, Docker needs to be installed and properly configured on each worker node of your Spark cluster. Also, the user that runs the Spark job needs to be in the Docker group.

Get MaRe

MaRe is packaged and distributed with Maven, all you have to do is to add its dependency to your pom.xml file:

<dependencies>
  ...
  <dependency>
    <groupId>se.uu.it</groupId>
    <artifactId>mare</artifactId>
    <version>0.3.0</version>
  </dependency>
  ...
</dependencies>

Documentation

API documentation is available here: https://mcapuccini.github.io/MaRe/scaladocs/.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].