All Projects → KristianHolsheimer → Pyspark Setup Guide

KristianHolsheimer / Pyspark Setup Guide

A guide for setting up Spark + PySpark under Ubuntu linux

Projects that are alternatives of or similar to Pyspark Setup Guide

Gan
python notebooks accompanying the book Make Your Own GAN
Stars: ✭ 50 (-5.66%)
Mutual labels:  jupyter-notebook
Homeless Arrests Analysis
A Los Angeles Times analysis of arrests of the homeless by the LAPD
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Info490 Sp17
Advanced Data Science, University of Illinois Spring 2017
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Policy Gradient Methods
Implementation of Algorithms from the Policy Gradient Family. Currently includes: A2C, A3C, DDPG, TD3, SAC
Stars: ✭ 54 (+1.89%)
Mutual labels:  jupyter-notebook
Data Privacy For Data Scientists
A workshop on data privacy methods for data scientists.
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Stock Market Prediction Using Natural Language Processing
We used Machine learning techniques to evaluate past data pertaining to the stock market and world affairs of the corresponding time period, in order to make predictions in stock trends. We built a model that will be able to buy and sell stock based on profitable prediction, without any human interactions. The model uses Natural Language Processing (NLP) to make smart “decisions” based on current affairs, article, etc. With NLP and the basic rule of probability, our goal is to increases the accuracy of the stock predictions.
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Aistudio Searching Data Dumps With Use
searching large heterogenous data dumps with Universal Sentence Encoder
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Blog of baojie
Some articles written by Bao Jie
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Notebooks
Some notebooks
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Mypresentations
this is my presentaion area .个人演讲稿展示区,主要展示一些平时的个人演讲稿或者心得之类的,
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
25daysinmachinelearning
I will update this repository to learn Machine learning with python with statistics content and materials
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Keras2kubernetes
Open source project to deploy Keras Deep Learning models packaged as Docker containers on Kubernetes.
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Commitgen
Code and data for the paper "A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes"
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Brihaspati
Collection of various implementations and Codes in Machine Learning, Deep Learning and Computer Vision ✨💥
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Tutoriais De Am
Algoritmos de aprendizado de máquina criados manualmente para maior compreensão das suas funcionalidades
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Transformer Tts
Implementation of "FastSpeech: Fast, Robust and Controllable Text to Speech"
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Handwritten Character Recognition
This a Deep learning AI system which recognize handwritten characters, Here I use chars74k data-set for training the model
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Mastering Python Data Analysis
Mastering-Python-Data-Analysis
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Info490 Fa16
INFO 490: Foundations of Data Science, offered in the Fall 2016 Semester at the University of Illinois
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook
Wiki generator live
live code
Stars: ✭ 53 (+0%)
Mutual labels:  jupyter-notebook

Spark + pyspark setup guide

This is guide for installing and configuring an instance of Apache Spark and its python API pyspark on a single machine running ubuntu 15.04.

-- Kristian Holsheimer, July 2015


Table of Contents

  1. Install Requirements

    1.1 Install Java

    1.2 Install Scala

    1.3 Install git

    1.4 Install py4j

  2. Set Up Apache Spark

    2.1 Download source

    2.2 Compile source

    2.3 Install files

  3. Examples

    3.1 Hello World: Word Count


In order to run Spark, we need Scala, which in turn requires Java. So, let's install these requirements first

1 | Install Requirements

1.1 | Install Java

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

Check if installation was successful by running:

$ java -version

The output should be something like:

java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

1.2 | Install Scala

Download and install deb package from scala-lang.org:

$ cd ~/Downloads
$ wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb
$ sudo dpkg -i scala-2.11.7.deb

Note: You may want to check if there's a more recent version. At the time of this writing, 2.11.7 was the most recent stable release. Visit the Scala download page to check for updates.

Again, let's check whether the installation was successful by running:

$ scala -version

which should return something like:

Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL

1.3 | Install git

We shall install Apache Spark by building it from source. This procedure depends implicitly on git, thus be sure install git if you haven't already:

$ sudo apt-get -y install git

1.4 | Install py4j

PySpark requires the py4j python package. If you're running a virtual environment, run:

$ pip install py4j

otherwise, run:

$ sudo pip install py4j

2 | Install Apache Spark

2.1 | Download and extract source tarball

$ cd ~/Downloads
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0.tgz
$ tar xvf spark-1.6.0.tgz

Note: Also here, you may want to check if there's a more recent version: visit the Spark download page.

2.2 | Compile source

$ cd ~/Downloads/spark-1.6.0
$ sbt/sbt assembly

This will take a while... (approximately 20 ~ 30 minutes)

After the dust settles, you can check whether Spark installed correctly by running the following example that should return the number π ≈ 3.14159...

$ ./bin/run-example SparkPi 10

This should return the line:

Pi is roughly 3.14042

Note: You want to lower the verbosity level of the log4j logger. You can do so by running editing your the log4j properties file (assuming we're still inside the ~/Downloads/spark-1.4.0 folder):

$ cp conf/log4j.properties.template conf/log4j.properties
$ nano conf/log4j.properties

and replace the line:

log4j.rootCategory=INFO, console

by

log4j.rootCategory=ERROR, console

2.3 | Install files

$ sudo mv ~/Downloads/spark-1.6.0 /opt/
$ sudo ln -s /opt/spark-1.6.0 /opt/spark

Add this to your path by editing your bashrc file:

$ nano ~/.bashrc

Add the following lines at the bottom of this file:

# needed for Apache Spark
export SPARK_HOME=/opt/spark
export PYTHONPATH=$SPARK_HOME/python

Restart bash to make use of these changes by running:

$ . ~/.bashrc

If your ipython instance somehow doesn't find these environment variables for whatever reason, you could also make sure they are set when ipython spins up. Let's add this to our ipython settings by creating a new python script named load_spark_environment_variables.py in the default profile startup folder:

$ nano ~/.ipython/profile_default/startup/load_spark_environment_variables.py

and paste the following lines in this file:

import os
import sys

if 'SPARK_HOME' not in os.environ:
    os.environ['SPARK_HOME'] = '/opt/spark'

if '/opt/spark/python' not in sys.path:
    sys.path.insert(0, '/opt/spark/python')

3 | Examples

Now we're finally ready to start running our first PySpark application. Load the spark context by opening up a python interpreter (or ipython / ipython notebook) and running:

>>> from pyspark import SparkContext
>>> sc = SparkContext()

The spark context variable sc is your gateway towards everything sparkly.

3.1 | Hello World: Word Count

Check out the notebook spark_word_count.ipynb.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].