Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → jupyter-incubator → Sparkmagic

jupyter-incubator / Sparkmagic

Licence: other

Jupyter magics and kernels for working with remote Spark clusters

Programming Languages

python

139335 projects - #7 most used programming language

Labels

jupyter-notebook spark jupyter kernel cluster notebook magic pyspark pandas-dataframe sql-query

Projects that are alternatives of or similar to Sparkmagic

Spark Py Notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Stars: ✭ 1,338 (+40.25%)

Mutual labels: jupyter-notebook, spark, notebook, pyspark

Enterprise gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.

Stars: ✭ 412 (-56.81%)

Mutual labels: jupyter-notebook, spark, jupyter, kernel

Pyspark Setup Demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

Stars: ✭ 24 (-97.48%)

Mutual labels: jupyter-notebook, jupyter, pyspark

Jupyter C Kernel

Minimal Jupyter C kernel

Stars: ✭ 463 (-51.47%)

Mutual labels: jupyter, notebook, kernel

Dyalog Jupyter Kernel

A Jupyter kernel for Dyalog APL

Stars: ✭ 26 (-97.27%)

Mutual labels: jupyter, notebook, kernel

Spark Scala Tutorial

A free tutorial for Apache Spark.

Stars: ✭ 907 (-4.93%)

Mutual labels: jupyter-notebook, spark, jupyter

Hands On Nltk Tutorial

The hands-on NLTK tutorial for NLP in Python

Stars: ✭ 419 (-56.08%)

Mutual labels: jupyter-notebook, jupyter, notebook

Spark Tdd Example

A simple Spark TDD example

Stars: ✭ 23 (-97.59%)

Mutual labels: jupyter-notebook, spark, pyspark

Beakerx

Beaker Extensions for Jupyter Notebook

Stars: ✭ 2,594 (+171.91%)

Mutual labels: jupyter-notebook, jupyter, notebook

Pandas Profiling

Create HTML profiling reports from pandas DataFrame objects

Stars: ✭ 8,329 (+773.06%)

Mutual labels: jupyter-notebook, jupyter, pandas-dataframe

Digital Signal Processing Lecture

Digital Signal Processing - Theory and Computational Examples

Stars: ✭ 532 (-44.23%)

Mutual labels: jupyter-notebook, jupyter, notebook

Justenoughscalaforspark

A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.

Stars: ✭ 538 (-43.61%)

Mutual labels: jupyter-notebook, spark, jupyter

Quantitative Notebooks

Educational notebooks on quantitative finance, algorithmic trading, financial modelling and investment strategy

Stars: ✭ 356 (-62.68%)

Mutual labels: jupyter-notebook, jupyter, notebook

Gophernotes

The Go kernel for Jupyter notebooks and nteract.

Stars: ✭ 3,100 (+224.95%)

Mutual labels: jupyter-notebook, jupyter, kernel

Spark Jupyter Aws

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

Stars: ✭ 259 (-72.85%)

Mutual labels: jupyter-notebook, spark, jupyter

Sklearn Classification

Data Science Notebook on a Classification Task, using sklearn and Tensorflow.

Stars: ✭ 518 (-45.7%)

Mutual labels: jupyter-notebook, jupyter, notebook

Elasticsearch Spark Recommender

Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch

Stars: ✭ 707 (-25.89%)

Mutual labels: jupyter-notebook, spark, jupyter

Paperboy

A web frontend for scheduling Jupyter notebook reports

Stars: ✭ 221 (-76.83%)

Mutual labels: jupyter-notebook, jupyter, notebook

Applied Reinforcement Learning

Reinforcement Learning and Decision Making tutorials explained at an intuitive level and with Jupyter Notebooks

Stars: ✭ 229 (-76%)

Mutual labels: jupyter-notebook, jupyter, notebook

Data Science Your Way

Ways of doing Data Science Engineering and Machine Learning in R and Python

Stars: ✭ 530 (-44.44%)

Mutual labels: jupyter-notebook, jupyter, notebook

View All Similar Projects ➔

sparkmagic

Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.

Features

Run Spark code in multiple languages against any remote Spark cluster through Livy
Automatic SparkContext (sc) and HiveContext (sqlContext) creation
Easily execute SparkSQL queries with the %%sql magic
Automatic visualization of SQL queries in the PySpark, Spark and SparkR kernels; use an easy visual interface to interactively construct visualizations, no code required
Easy access to Spark application information and logs (%%info magic)
Ability to capture the output of SQL queries as Pandas dataframes to interact with other Python libraries (e.g. matplotlib)
Send local files or dataframes to a remote cluster (e.g. sending pretrained local ML model straight to the Spark cluster)
Authenticate to Livy via Basic Access authentication or via Kerberos

Examples

There are two ways to use sparkmagic. Head over to the examples section for a demonstration on how to use both models of execution.

1. Via the IPython kernel

The sparkmagic library provides a %%spark magic that you can use to easily run code against a remote Spark cluster from a normal IPython notebook. See the Spark Magics on IPython sample notebook

2. Via the PySpark and Spark kernels

The sparkmagic library also provides a set of Scala and Python kernels that allow you to automatically connect to a remote Spark cluster, run code and SQL queries, manage your Livy server and Spark job configuration, and generate automatic visualizations. See Pyspark and Spark sample notebooks.

3. Sending local data to Spark Kernel

See the Sending Local Data to Spark notebook.

Installation

Install the library
```
 pip install sparkmagic
```

Make sure that ipywidgets is properly installed by running

 jupyter nbextension enable --py --sys-prefix widgetsnbextension

If you're using JupyterLab, you'll need to run another command:

 jupyter labextension install "@jupyter-widgets/jupyterlab-manager"

(Optional) Install the wrapper kernels. Do pip show sparkmagic and it will show the path where sparkmagic is installed at. cd to that location and do:

 jupyter-kernelspec install sparkmagic/kernels/sparkkernel
 jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
 jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

(Optional) Modify the configuration file at ~/.sparkmagic/config.json. Look at the example_config.json
(Optional) Enable the server extension so that clusters can be programatically changed:
```
 jupyter serverextension enable --py sparkmagic
```

Authentication Methods

Sparkmagic supports:

No auth
Basic authentication
Kerberos

The Authenticator is the mechanism for authenticating to Livy. The base Authenticator used by itself supports no auth, but it can be subclassed to enable authentication via other methods. Two such examples are the Basic and Kerberos Authenticators.

Kerberos Authenticator

Kerberos support is implemented via the requests-kerberos package. Sparkmagic expects a kerberos ticket to be available in the system. Requests-kerberos will pick up the kerberos ticket from a cache file. For the ticket to be available, the user needs to have run kinit to create the kerberos ticket.

Kerberos Configuration

By default the HTTPKerberosAuth constructor provided by the requests-kerberos package will use the following configuration

HTTPKerberosAuth(mutual_authentication=REQUIRED)

but this will not be right configuration for every context, so it is able to pass custom arguments for this constructor using the following configuration on the ~/.sparkmagic/config.json

{
    "kerberos_auth_configuration": {
        "mutual_authentication": 1,
        "service": "HTTP",
        "delegate": false,
        "force_preemptive": false,
        "principal": "principal",
        "hostname_override": "hostname_override",
        "sanitize_mutual_error_response": true,
        "send_cbt": true
    }
}

Custom Authenticators

You can write custom Authenticator subclasses to enable authentication via other mechanisms. All Authenticator subclasses should override the Authenticator.__call__(request) method that attaches HTTP Authentication to the given Request object.

Authenticator subclasses that add additional class attributes to be used for the authentication, such as the [Basic] (sparkmagic/sparkmagic/auth/basic.py) authenticator which adds username and password attributes, should override the __hash__, __eq__, update_with_widget_values, and get_widgets methods to work with these new attributes. This is necessary in order for the Authenticator to use these attributes in the authentication process.

Using a Custom Authenticator with Sparkmagic

If your repository layout is:

    .
    ├── LICENSE
    ├── README.md
    ├── customauthenticator
    │   ├── __init__.py 
    │   ├── customauthenticator.py 
    └── setup.py

Then to pip install from this repository, run: pip install git+https://git_repo_url/#egg=customauthenticator

After installing, you need to register the custom authenticator with Sparkmagic so it can be dynamically imported. This can be done in two different ways:

Edit the configuration file at ~/.sparkmagic/config.json with the following settings:
```
{
    "authenticators": {
        "Kerberos": "sparkmagic.auth.kerberos.Kerberos",
        "None": "sparkmagic.auth.customauth.Authenticator",
        "Basic_Access": "sparkmagic.auth.basic.Basic",
        "Custom_Auth": "customauthenticator.customauthenticator.CustomAuthenticator"
  }
}
```
This adds your CustomAuthenticator class in customauthenticator.py to Sparkmagic. Custom_Auth is the authentication type that will be displayed in the %manage_spark widget's Auth type dropdown as well as the Auth type passed as an argument to the -t flag in the %spark add session magic.

Modify the authenticators method in sparkmagic/utils/configuration.py to return your custom authenticator:

def authenticators():
        return {
                u"Kerberos": u"sparkmagic.auth.kerberos.Kerberos",
                u"None": u"sparkmagic.auth.customauth.Authenticator",
                u"Basic_Access": u"sparkmagic.auth.basic.Basic", 
                u"Custom_Auth": u"customauthenticator.customauthenticator.CustomAuthenticator"
        }

Papermill

If you want Papermill rendering to stop on a Spark error, edit the ~/.sparkmagic/config.json with the following settings:

{
    "shutdown_session_on_spark_statement_errors": true,
    "all_errors_are_fatal": true
}

If you want any registered livy sessions to be cleaned up on exit regardless of whether the process exits gracefully or not, you can set:

{
    "cleanup_all_sessions_on_exit": true,
    "all_errors_are_fatal": true
}

Conf overrides in code

In addition to the conf at ~/.sparkmagic/config.json, sparkmagic conf can be overridden programmatically in a notebook.

For example:

import sparkmagic.utils.configuration as conf
conf.override('cleanup_all_sessions_on_exit', True)

Same thing, but referencing the conf member:

conf.override(conf.cleanup_all_sessions_on_exit.__name__, True)

NOTE: override for cleanup_all_sessions_on_exit must be set before initializing sparkmagic ie. before this:

%load_ext sparkmagic.magics

Docker

The included docker-compose.yml file will let you spin up a full sparkmagic stack that includes a Jupyter notebook with the appropriate extensions installed, and a Livy server backed by a local-mode Spark instance. (This is just for testing and developing sparkmagic itself; in reality, sparkmagic is not very useful if your Spark instance is on the same machine!)

In order to use it, make sure you have Docker and Docker Compose both installed, and then simply run:

docker-compose build
docker-compose up

You will then be able to access the Jupyter notebook in your browser at http://localhost:8888. Inside this notebook, you can configure a sparkmagic endpoint at http://spark:8998. This endpoint is able to launch both Scala and Python sessions. You can also choose to start a wrapper kernel for Scala, Python, or R from the list of kernels.

To shut down the containers, you can interrupt docker-compose with Ctrl-C, and optionally remove the containers with docker-compose down.

If you are developing sparkmagic and want to test out your changes in the Docker container without needing to push a version to PyPI, you can set the dev_mode build arg in docker-compose.yml to true, and then re-build the container. This will cause the container to install your local version of autovizwidget, hdijupyterutils, and sparkmagic. Make sure to re-run docker-compose build before each test run.

Server extension API

`/reconnectsparkmagic`:

POST: Allows to specify Spark cluster connection information to a notebook passing in the notebook path and cluster information. Kernel will be started/restarted and connected to cluster specified.

Request Body example: { 'path': 'path.ipynb', 'username': 'username', 'password': 'password', 'endpoint': 'url', 'auth': 'Kerberos', 'kernelname': 'pysparkkernel' }

Note that the auth can be either None, Basic_Access or Kerberos based on the authentication enabled in livy. The kernelname parameter is optional and defaults to the one specified on the config file or pysparkkernel if not on the config file. Returns 200 if successful; 400 if body is not JSON string or key is not found; 500 if error is encountered changing clusters.

Reply Body example: { 'success': true, 'error': null }

Architecture

Sparkmagic uses Livy, a REST server for Spark, to remotely execute all user code. The library then automatically collects the output of your code as plain text or a JSON document, displaying the results to you as formatted text or as a Pandas dataframe as appropriate.

This architecture offers us some important advantages:

Run Spark code completely remotely; no Spark components need to be installed on the Jupyter server
Multi-language support; the Python, Python3, Scala and R kernels are equally feature-rich, and adding support for more languages will be easy
Support for multiple endpoints; you can use a single notebook to start multiple Spark jobs in different languages and against different remote clusters
Easy integration with any Python library for data science or visualization, like Pandas or Plotly

However, there are some important limitations to note:

Some overhead added by sending all code and output through Livy
Since all code is run on a remote driver through Livy, all structured data must be serialized to JSON and parsed by the Sparkmagic library so that it can be manipulated and visualized on the client side. In practice this means that you must use Python for client-side data manipulation in %%local mode.

Contributing

We welcome contributions from everyone. If you've made an improvement to our code, please send us a pull request.

To dev install, execute the following:

    git clone https://github.com/jupyter-incubator/sparkmagic
    pip install -e hdijupyterutils 
    pip install -e autovizwidget
    pip install -e sparkmagic

and optionally follow steps 3 and 4 above.

To run unit tests, run:

    nosetests hdijupyterutils autovizwidget sparkmagic

If you want to see an enhancement made but don't have time to work on it yourself, feel free to submit an issue for us to deal with.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 954

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (135) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

jupyter-incubator / Sparkmagic

Programming Languages

Labels

Projects that are alternatives of or similar to Sparkmagic

sparkmagic

Features

Examples

1. Via the IPython kernel

2. Via the PySpark and Spark kernels

3. Sending local data to Spark Kernel

Installation

Authentication Methods

Kerberos Authenticator

Kerberos Configuration

Custom Authenticators

Using a Custom Authenticator with Sparkmagic

Papermill

Conf overrides in code

Docker

Server extension API

/reconnectsparkmagic:

Architecture

Contributing

`/reconnectsparkmagic`: