All Projects → deanwampler → Justenoughscalaforspark

deanwampler / Justenoughscalaforspark

Licence: apache-2.0
A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Justenoughscalaforspark

Spark Scala Tutorial
A free tutorial for Apache Spark.
Stars: ✭ 907 (+68.59%)
Mutual labels:  jupyter-notebook, spark, jupyter, tutorial
Ncar Python Tutorial
Numerical & Scientific Computing with Python Tutorial
Stars: ✭ 50 (-90.71%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Hands On Nltk Tutorial
The hands-on NLTK tutorial for NLP in Python
Stars: ✭ 419 (-22.12%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Almond
A Scala kernel for Jupyter
Stars: ✭ 1,354 (+151.67%)
Mutual labels:  jupyter-notebook, spark, jupyter
Elasticsearch Spark Recommender
Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch
Stars: ✭ 707 (+31.41%)
Mutual labels:  jupyter-notebook, spark, jupyter
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (+77.32%)
Mutual labels:  jupyter-notebook, spark, jupyter
Scipy2017 Jupyter Widgets Tutorial
Notebooks for the SciPy 2017 tutorial "The Jupyter Interactive Widget Ecosystem"
Stars: ✭ 102 (-81.04%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Learnpythonforresearch
This repository provides everything you need to get started with Python for (social science) research.
Stars: ✭ 163 (-69.7%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Data Science Your Way
Ways of doing Data Science Engineering and Machine Learning in R and Python
Stars: ✭ 530 (-1.49%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Learn jupyter
This is a jupyter practical tutorial. Welcome to edit together!
Stars: ✭ 123 (-77.14%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Intro To Python
An intro to Python & programming for wanna-be data scientists
Stars: ✭ 536 (-0.37%)
Mutual labels:  jupyter-notebook, jupyter, tutorial
Spark Jupyter Aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Stars: ✭ 259 (-51.86%)
Mutual labels:  jupyter-notebook, spark, jupyter
Enterprise gateway
A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
Stars: ✭ 412 (-23.42%)
Mutual labels:  jupyter-notebook, spark, jupyter
Py d3
D3 block magic for Jupyter notebook.
Stars: ✭ 428 (-20.45%)
Mutual labels:  jupyter-notebook, jupyter
Ifsharp
F# for Jupyter Notebooks
Stars: ✭ 424 (-21.19%)
Mutual labels:  jupyter-notebook, jupyter
Tensorflow Lstm Regression
Sequence prediction using recurrent neural networks(LSTM) with TensorFlow
Stars: ✭ 433 (-19.52%)
Mutual labels:  jupyter-notebook, jupyter
Dsp Theory
Theory of digital signal processing (DSP): signals, filtration (IIR, FIR, CIC, MAF), transforms (FFT, DFT, Hilbert, Z-transform) etc.
Stars: ✭ 437 (-18.77%)
Mutual labels:  jupyter-notebook, tutorial
Code search
Code For Medium Article: "How To Create Natural Language Semantic Search for Arbitrary Objects With Deep Learning"
Stars: ✭ 436 (-18.96%)
Mutual labels:  jupyter-notebook, tutorial
Deeplearningzerotoall
TensorFlow Basic Tutorial Labs
Stars: ✭ 4,239 (+687.92%)
Mutual labels:  jupyter-notebook, tutorial
Jupyter tensorboard
Start Tensorboard in Jupyter Notebook
Stars: ✭ 446 (-17.1%)
Mutual labels:  jupyter-notebook, jupyter

Just Enough Scala for Spark

Join the chat at https://gitter.im/deanwampler/JustEnoughScalaForSpark

  • Spark Summit San Francisco, June 5, 2017
  • Strata London, May 23, 2017
  • Strata San Jose, March 14, 2017
  • Strata Singapore, December 6, 2016
  • Strata NYC, September 27, 2016

Dean Wampler, Ph.D.
Chaoran Yu taught this tutorial at a few conferences, too.

NEW: François Sarradin (@fsarradin) and colleagues translated this tutorial to French. You can find it here.

This tutorial now uses a Docker image with Jupyter and Spark, for a much more robust, easy to use, and "industry standard" experience.

This tutorial covers the most important features and idioms of Scala you need to use Apache Spark's Scala APIs. Because Spark is written in Scala, Spark is driving interest in Scala, especially for data engineers. Data scientists sometimes use Scala, but most use Python or R.

Tips:

  1. If you're taking this tutorial at a conference, it's essential that you set up the tutorial ahead of time, as there won't be time during the session to work on any problems.
  2. Use the Gitter chat room to ask for help or post issues to the GitHub repo if you have trouble installing or running the tutorial.
  3. If all else fails, there is a PDF of the tutorial in the notebooks directory.

Prerequisites

I'll assume you have prior programming experience, in any language. Some familiarity with Java is assumed, but if you don't know Java, you should be able to search for explanations for anything unfamiliar.

This isn't an introduction to Spark itself. Some prior exposure to Spark is helpful, but I'll briefly explain most Spark concepts we'll encounter, too.

Throughout, you'll find links to more information on important topics.

Download the Tutorial

Begin by cloning or downloading the tutorial GitHub project github.com/deanwampler/JustEnoughScalaForSpark.

About Jupyter with Spark

This tutorial uses a Docker image that combines the popular Jupyter notebook environment with all the tools you need to run Spark, including the Scala language, called the All Spark Notebook. It bundles Apache Toree to provide Spark and Scala access. The webpage for this Docker image discusses useful information like using Python as well as Scala, user authentication topics, running your Spark jobs on clusters, rather than local mode, etc.

There are other notebook options you might investigate for your needs:

Open source:

  • Polynote - A cross-language notebook environment with built-in Scala support. Developed by Netflix.
  • Jupyter + BeakerX - a powerful set of extensions for Jupyter.
  • Zeppelin - a popular tool in big data environments

Commercial:

  • Databricks - a feature-rich, commercial, cloud-based service from the creators of Spark

Running the Tutorial

If you need to install Docker, follow the installation instructions at docker.com (the community edition is sufficient).

Now we'll run the docker image. It's important to follow the next steps carefully. We're going to mount the working directory in the running container so it's accessible inside the running container. We'll need it for our notebook, our data, etc.

  • Open a terminal or command window
  • Change to the directory where you expanded the tutorial project or cloned the repo
  • To download and run the Docker image, run the following command: run.sh (MacOS and Linux) or run.bat (Windows)

The MacOS and Linux run.sh command executes this command:

docker run -it --rm \
  -p 8888:8888 -p 4040:4040 \
  --cpus=2.0 --memory=2000M \
  -v "$PWD":/home/jovyan/work \
  "[email protected]" \
  jupyter/all-spark-notebook

The Windows run.bat command is similar, but uses Windows conventions.

The --cpus=... --memory=... arguments were added because the notebook "kernel" is prone to crashing with the default values. Edit to taste. Also, it will help to keep only one notebook (other than the Introduction) open at a time.

The -v $PWD:/home/jovyan/work tells Docker to mount the current working directory inside the container as /home/jovyan/work. This is essential to provide access to the tutorial data and notebooks. When you open the notebook UI (discussed shortly), you'll see this folder listed.

Notes:

  1. On Windows, you may get the following error: C:\Program Files\Docker\Docker\Resources\bin\docker.exe: Error response from daemon: D: drive is not shared. Please share it in Docker for Windows Settings." If so, do the following. On your tray, next to your clock, right-click on Docker, then click on Settings. You'll see the Shared Drives. Mark your drive and hit apply. See this Docker forum thread for more tips.
  2. The command defaults to the latest docker image tag. If you suspect there's a breaking change in a Docker image more recent than the last updates to this tutorial, try using jupyter/all-spark-notebook:619e9cc2fc07 instead.

The -p 8888:8888 -p 4040:4040 arguments tells Docker to "tunnel" ports 8888 and 4040 out of the container to your local environment, so you can get to the Jupyter UI at port 8888 and the Spark driver UI at 4040.

Note: Here we use just one notebook, but if we used several notebooks concurrently, the second notebook's Spark instance would use port 4041, the third would use 4042, etc.. Keep this in mind if you adapt this project for your own needs.

You should see output similar to the following:

Unable to find image 'jupyter/all-spark-notebook:latest' locally
latest: Pulling from jupyter/all-spark-notebook
e0a742c2abfd: Pull complete
...
ed25ef62a9dd: Pull complete
Digest: sha256:...
Status: Downloaded newer image for jupyter/all-spark-notebook:latest
Execute the command: jupyter notebook
...
[I 19:08:15.017 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:08:15.019 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=...

Now copy and paste the URL shown in a browser window.

Tip: If you're using iTerm on a Mac, just click the URL while holding the command key.

Warning: When you quit the Docker container at the end of the tutorial, all your changes will be lost, unless they are in or under the current working directory that we mounted! To save notebooks you defined in other locations, export them using the File > Download as > Notebook menu item in toolbar.

Running the Tutorial

Warning: It appears that the Jupyter magics in the notebook no longer work. I have added comments and workarounds.

Now we can load the tutorial. Once you open the Jupyter UI, you'll see the work listed. Click once to open it, then open notebooks, then click on the tutorial notebook, JustEnoughScalaForSpark.ipynb. It will open in a new tab. (The PDF is a print out of the notebook, in case you have trouble running the notebook itself.)

You'll notice there is a box around the first "cell". This cell has one line of source code println("Hello World!"). Above this cell is a toolbar with a button that has a right-pointing arrow and the word run. Click that button to run this code cell. Or, use the menu item Cell > Run Cells.

After many seconds, once initialization has completed, it will print the output, Hello World! just below the input text field.

Do the same thing for the next box. It should print [merrywivesofwindsor, twelfthnight, midsummersnightsdream, loveslabourslost, asyoulikeit, comedyoferrors, muchadoaboutnothing, tamingoftheshrew], the contents of the /home/jovyan/work/data/shakespeare folder, the texts for several of Shakespeare's plays. We'll use these files as data.

Warning: If you see [] or null printed instead, the mounting of the current working directory did not work correctly when the container was started. In the terminal window, use control-c to exit from the Docker container, make sure you are in the root directory of the project (data and notebooks should be subdirectories), restart the docker image, and make sure you enter the command exactly as shown.

If these steps worked, you're done setting up the tutorial!

Getting Help

If you're having problems, use the Gitter chat room to ask for help. If you're reasonably certain you've found a bug, post an issue to the GitHub repo. Recall that the notebooks directory also has a PDF of the notebook that you can read when the notebook won't work.

What's Next?

You are now ready to go through the tutorial.

Don't want to run Spark Notebook to learn the material? A PDF printout of the notebook can also be found in the notebooks directory.

Please post any feedback, bugs, or even pull requests to the project's GitHub page. Thanks.

Dean Wampler

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].