All Projects → deanwampler → Spark Scala Tutorial

deanwampler / Spark Scala Tutorial

Licence: other
A free tutorial for Apache Spark.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Spark Scala Tutorial

Justenoughscalaforspark
A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.
Stars: ✭ 538 (-40.68%)
Mutual labels:  jupyter-notebook, spark, jupyter, tutorial
Ncar Python Tutorial
Numerical & Scientific Computing with Python Tutorial
Stars: ✭ 50 (-94.49%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Elasticsearch Spark Recommender
Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch
Stars: ✭ 707 (-22.05%)
Mutual labels:  jupyter-notebook, spark, jupyter
Learn jupyter
This is a jupyter practical tutorial. Welcome to edit together!
Stars: ✭ 123 (-86.44%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (+5.18%)
Mutual labels:  jupyter-notebook, spark, jupyter
Almond
A Scala kernel for Jupyter
Stars: ✭ 1,354 (+49.28%)
Mutual labels:  jupyter-notebook, spark, jupyter
Scipy2017 Jupyter Widgets Tutorial
Notebooks for the SciPy 2017 tutorial "The Jupyter Interactive Widget Ecosystem"
Stars: ✭ 102 (-88.75%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Hands On Nltk Tutorial
The hands-on NLTK tutorial for NLP in Python
Stars: ✭ 419 (-53.8%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Enterprise gateway
A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
Stars: ✭ 412 (-54.58%)
Mutual labels:  jupyter-notebook, spark, jupyter
Spark Jupyter Aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Stars: ✭ 259 (-71.44%)
Mutual labels:  jupyter-notebook, spark, jupyter
Learnpythonforresearch
This repository provides everything you need to get started with Python for (social science) research.
Stars: ✭ 163 (-82.03%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Data Science Your Way
Ways of doing Data Science Engineering and Machine Learning in R and Python
Stars: ✭ 530 (-41.57%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Intro To Python
An intro to Python & programming for wanna-be data scientists
Stars: ✭ 536 (-40.9%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Network Analysis Made Simple
An introduction to network analysis and applied graph theory using Python and NetworkX
Stars: ✭ 700 (-22.82%)
Mutual labels:  jupyter-notebook, tutorial
Nteract
📘 The interactive computing suite for you! ✨
Stars: ✭ 5,713 (+529.88%)
Mutual labels:  jupyter-notebook, jupyter
Cookbook 2nd
IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018
Stars: ✭ 704 (-22.38%)
Mutual labels:  jupyter-notebook, jupyter
Fastai2
Temporary home for fastai v2 while it's being developed
Stars: ✭ 630 (-30.54%)
Mutual labels:  jupyter-notebook, jupyter
Juliatutorials
Learn Julia via interactive tutorials!
Stars: ✭ 732 (-19.29%)
Mutual labels:  jupyter-notebook, tutorial
Pandas exercises
Practice your pandas skills!
Stars: ✭ 7,140 (+687.21%)
Mutual labels:  jupyter-notebook, tutorial
Getting Things Done With Pytorch
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoencoders, Object Detection with YOLO v5, Build your first Neural Network, Time Series forecasting for Coronavirus daily cases, Sentiment Analysis with BERT.
Stars: ✭ 738 (-18.63%)
Mutual labels:  jupyter-notebook, tutorial

Apache Spark Scala Tutorial - README

Join the chat at https://gitter.im/deanwampler/spark-scala-tutorial

Dean Wampler
[email protected]
@deanwampler

This tutorial demonstrates how to write and run Apache Spark applications using Scala with some SQL. I also teach a little Scala as we go, but if you already know Spark and you are more interested in learning just enough Scala for Spark programming, see my other tutorial Just Enough Scala for Spark.

You can run the examples and exercises several ways:

  1. Notebooks, like Jupyter - The easiest way, especially for data scientists accustomed to notebooks.
  2. In an IDE, like IntelliJ - Familiar for developers.
  3. At the terminal prompt using the build tool SBT.

This tutorial is mostly about learning Spark, but I teach you a little Scala as we go. If you are more interested in learning just enough Scala for Spark programming, see my new tutorial Just Enough Scala for Spark.

Notes:

  1. The current version of Spark used is 2.3.X, which is a bit old. (TODO!)
  2. While the notebook approach is the easiest way to use this tutorial to learn Spark, the IDE and SBT options show details for creating Spark applications, i.e., writing executable programs you build and run, as well as examples that use the interactive Spark Shell.

Acknowledgments

I'm grateful that several people have provided feedback, issue reports, and pull requests. In particular:

Getting Help

Before describing the different ways to work with the tutorial, if you're having problems, use the Gitter chat room to ask for help. You can also use the new Discussions feature for the GitHub repo. If you're reasonably certain you've found a bug, post an issue to the GitHub repo. Pull requests are welcome, too!!

Setup Instructions

Let's get started...

Download the Tutorial

Begin by cloning or downloading the tutorial GitHub project github.com/deanwampler/spark-scala-tutorial.

Now Pick the way you want to work through the tutorial:

  1. Notebooks - Go here
  2. In an IDE, like IntelliJ - Go here
  3. At the terminal prompt using SBT - Go here

Using Notebooks

The easiest way to work with this tutorial is to use a Docker image that combines the popular Jupyter notebook environment with all the tools you need to run Spark, including the Scala language. It's called the all-spark-notebook. It bundles Apache Toree to provide Spark and Scala access. The webpage for this Docker image discusses useful information like using Python as well as Scala, user authentication topics, running your Spark jobs on clusters, rather than local mode, etc.

There are other notebook options you might investigate for your needs:

Open source:

  • Polynote - A cross-language notebook environment with built-in Scala support. Developed by Netflix.
  • Jupyter + BeakerX - a powerful set of extensions for Jupyter.
  • Zeppelin - a popular tool in big data environments

Commercial:

  • Databricks - a feature-rich, commercial, cloud-based service

Installing Docker and the Jupyter Image

If you need to install Docker, follow the installation instructions at docker.com (the community edition is sufficient).

Now we'll run the docker image. It's important to follow the next steps carefully. We're going to mount two local directories inside the running container, one for the data we want to use so and one for the notebooks.

  • Open a terminal or command window
  • Change to the directory where you expanded the tutorial project or cloned the repo
  • To download and run the Docker image, run the following command: run.sh (MacOS and Linux) or run.bat (Windows)

The MacOS and Linux run.sh command executes this command:

docker run -it --rm \
  -p 8888:8888 -p 4040:4040 \
  --cpus=2.0 --memory=2000M \
  -v "$PWD/data":/home/jovyan/data \
  -v "$PWD/notebooks":/home/jovyan/notebooks \
  "[email protected]" \
  jupyter/all-spark-notebook

The Windows run.bat command is similar, but uses Windows conventions.

The --cpus=... --memory=... arguments were added because the notebook "kernel" is prone to crashing with the default values. Edit to taste. Also, it will help to keep only one notebook (other than the Introduction) open at a time.

The -v PATH:/home/jovyan/dir tells Docker to mount the dir directory under your current working directory, so it's available as /home/jovyan/dir inside the container. This is essential to provide access to the tutorial data and notebooks. When you open the notebook UI (discussed shortly), you'll see these folders listed.

Note: On Windows, you may get the following error: C:\Program Files\Docker\Docker\Resources\bin\docker.exe: Error response from daemon: D: drive is not shared. Please share it in Docker for Windows Settings." If so, do the following. On your tray, next to your clock, right-click on Docker, then click on Settings. You'll see the Shared Drives. Mark your drive and hit apply. See this Docker forum thread for more tips.

The -p 8888:8888 -p 4040:4040 arguments tells Docker to "tunnel" ports 8888 and 4040 out of the container to your local environment, so you can get to the Jupyter UI at port 8888 and the Spark driver UI at 4040.

You should see output similar to the following:

Unable to find image 'jupyter/all-spark-notebook:latest' locally
latest: Pulling from jupyter/all-spark-notebook
e0a742c2abfd: Pull complete
...
ed25ef62a9dd: Pull complete
Digest: sha256:...
Status: Downloaded newer image for jupyter/all-spark-notebook:latest
Execute the command: jupyter notebook
...
[I 19:08:15.017 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:08:15.019 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=...

Now copy and paste the URL shown in a browser window. (Use command+click in your terminal window on MacOS.)

Warning: When you quit the Docker container at the end of the tutorial, all your changes will be lost, unless they are in the data and notebooks directories that we mounted! To save notebooks you defined in other locations, export them using the File > Download as > Notebook menu item in toolbar.

Running the Tutorial

In the Jupyter UI, you should see three folders, data, notebooks, and work. The first two are the folders we mounted. The data we'll use is in the data folder. The notebooks we'll use are... you get the idea.

Open the notebooks folder and click the link for 00_Intro.ipynb.

It opens in a new browser tab. It may take several seconds to load.

Tip: If the new tab fails to open or the notebook fails to load as shown, check the terminal window where you started Jupyter. Are there any error messages?

If you're new to Jupyter, try Help > User Interface Tour to learn how to use Jupyter. At a minimum, you need to new that the content is organized into cells. You can navigate with the up and down arrows or clicks. When you come to a cell with code, either click the run button in the toolbar or use shift+return to execute the code.

Read through the Introduction notebook, then navigate to the examples using the table near the bottom. I've set up the table so that clicking each link opens a new browser tab.

Use an IDE

The tutorial is also set up as a using the build tool SBT. The popular IDEs, like IntelliJ with the Scala plugin (required) and Eclipse with Scala, can import an SBT project and automatically create an IDE project from it.

Once imported, you can run the Spark job examples as regular applications. There are some examples implemented as scripts that need to be run using the Spark Shell or the SBT console. The tutorial goes into the details.

You are now ready to go through the tutorial.

Use SBT in a Terminal

Using SBT in a terminal is a good approach if you prefer to use a code editor like Emacs, Vim, or SublimeText. You'll need to install SBT, but not Scala or Spark. Those dependencies will be resolved when you build the software.

Start the sbt console, then build the code, where the sbt:spark-scala-tutorial> is the prompt I've configured for the project. Running test compiles the code and runs the tests, while package creates a jar file of the compiled code and configuration files:

$ sbt
...
sbt:spark-scala-tutorial> test
...
sbt:spark-scala-tutorial> package
...
sbt:spark-scala-tutorial>

You are now ready to go through the tutorial.

Going Forward from Here

To learn more, see the following resources:

Final Thoughts

Thank you for working through this tutorial. Feedback and pull requests are welcome.

Dean Wampler

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].