All Projects → snowplow → Spark Example Project

snowplow / Spark Example Project

Licence: apache-2.0
A Spark WordCountJob example as a standalone SBT project with Specs2 tests, runnable on Amazon EMR

Programming Languages

scala
5932 projects

Spark Example Project Build Status

Introduction

This is a simple word count job written in Scala for the Spark spark cluster computing platform, with instructions for running on [Amazon Elastic MapReduce] emr in non-interactive mode. The code is ported directly from Twitter's [WordCountJob] wordcount for Scalding.

This was built by the Data Science team at [Snowplow Analytics] snowplow, who use Spark on their [Data pipelines and algorithms] data-pipelines-algos projects.

See also: [Spark Streaming Example Project] spark-streaming-example-project | [Scalding Example Project] scalding-example-project

Building

Assuming git, [Vagrant] vagrant-install and [VirtualBox] virtualbox-install installed:

 host> git clone https://github.com/snowplow/spark-example-project
 host> cd spark-example-project
 host> vagrant up && vagrant ssh
guest> cd /vagrant
guest> sbt assembly

The 'fat jar' is now available as:

target/spark-example-project-0.4.0.jar

Unit testing

The assembly command above runs the test suite - but you can also run this manually with:

$ sbt test
<snip>
[info] + A WordCount job should
[info]   + count words correctly
[info] Passed: : Total 1, Failed 0, Errors 0, Passed 1, Skipped 0

Running on Amazon EMR

Prepare

Create:

  1. An AWS CLI profile, e.g. spark
  2. An Amazon S3 bucket, e.g. spark-example-project-your-name
  3. A EC2 keypair, e.g. spark-ec2-keypair
  4. A VPC public subnet, e.g. subnet-3dc2bd2a

Make sure you have assembled the jarfile (see above).

Upload and run

guest> inv upload spark spark-example-project-your-name
guest> inv run_emr spark spark-example-project-your-name spark-ec2-keypair subnet-3dc2bd2a

You can now monitor the running EMR jobflow in the AWS Elastic MapReduce UI.

Inspect

Once the job has completed, you should see a folder structure like this in your output bucket:

 results
 |
 +- _SUCCESS
 +- part-00000
 +- part-00001

Download the files and check that one file contains:

(hello,1)
(world,2)

while another file contains:

(goodbye,1)

Running on your own Spark cluster

If you have successfully run this on your own Spark cluster, we would welcome a pull-request updating the instructions in this section.

Next steps

Fork this project and adapt it into your own custom Spark job.

To invoke/schedule your Spark job on EMR, check out:

Roadmap

  • Change output from tuples to TSV ([#2] issue-2)

Copyright and license

Copyright 2013-2015 Snowplow Analytics Ltd.

Licensed under the [Apache License, Version 2.0] license (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].