All Projects → jadianes → Spark R Notebooks

jadianes / Spark R Notebooks

Licence: other
R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Spark R Notebooks

Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+1127.52%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, big-data, notebook, bigdata
Quantitative Notebooks
Educational notebooks on quantitative finance, algorithmic trading, financial modelling and investment strategy
Stars: ✭ 356 (+226.61%)
Mutual labels:  jupyter-notebook, data-science, jupyter, data-analysis, notebook
Data Science Your Way
Ways of doing Data Science Engineering and Machine Learning in R and Python
Stars: ✭ 530 (+386.24%)
Mutual labels:  jupyter-notebook, data-science, jupyter, notebook, exploratory-data-analysis
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+7541.28%)
Mutual labels:  jupyter-notebook, data-science, jupyter, data-analysis, exploratory-data-analysis
Ml Workspace
🛠 All-in-one web-based IDE specialized for machine learning and data science.
Stars: ✭ 2,337 (+2044.04%)
Mutual labels:  jupyter-notebook, data-science, jupyter, data-analysis
Courses
Quiz & Assignment of Coursera
Stars: ✭ 454 (+316.51%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, big-data
Datasciencevm
Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)
Stars: ✭ 153 (+40.37%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, big-data
Pythondata
repo for code published on pythondata.com
Stars: ✭ 113 (+3.67%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, big-data
Jupytemplate
Templates for jupyter notebooks
Stars: ✭ 85 (-22.02%)
Mutual labels:  jupyter-notebook, data-science, jupyter, notebook
Tennis Crystal Ball
Ultimate Tennis Statistics and Tennis Crystal Ball - Tennis Big Data Analysis and Prediction
Stars: ✭ 107 (-1.83%)
Mutual labels:  data-science, data-analysis, big-data, bigdata
Nteract
📘 The interactive computing suite for you! ✨
Stars: ✭ 5,713 (+5141.28%)
Mutual labels:  jupyter-notebook, data-science, jupyter, notebook
My Journey In The Data Science World
📢 Ready to learn or review your knowledge!
Stars: ✭ 1,175 (+977.98%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, big-data
Cookbook 2nd Code
Code of the IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018 [read-only repository]
Stars: ✭ 541 (+396.33%)
Mutual labels:  jupyter-notebook, data-science, jupyter, data-analysis
Cookbook 2nd
IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018
Stars: ✭ 704 (+545.87%)
Mutual labels:  jupyter-notebook, data-science, jupyter, data-analysis
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+804.59%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, bigdata
Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Stars: ✭ 71 (-34.86%)
Mutual labels:  jupyter-notebook, big-data, bigdata
Allstate capstone
Allstate Kaggle Competition ML Capstone Project
Stars: ✭ 72 (-33.94%)
Mutual labels:  jupyter-notebook, data-science, notebook
Covid19 Dashboard
A site that displays up to date COVID-19 stats, powered by fastpages.
Stars: ✭ 1,212 (+1011.93%)
Mutual labels:  jupyter-notebook, data-science, jupyter
Countly Sdk Cordova
Countly Product Analytics SDK for Cordova, Icenium and Phonegap
Stars: ✭ 69 (-36.7%)
Mutual labels:  data-analysis, big-data, bigdata
Hyperlearn
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster
Stars: ✭ 1,204 (+1004.59%)
Mutual labels:  jupyter-notebook, data-science, data-analysis

SparkR Notebooks

Join the chat at https://gitter.im/jadianes/spark-r-notebooks

This is a collection of Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the R language.

If your are interested in being introduced to some basic Data Science Engineering concepts and applications, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R. Additionally, if you are interested in using Python with Spark, you can have a look at our pySpark notebooks.

Instructions

For these series of notebooks, we have used Jupyter with the IRkernel R kernel. You can find installation instructions for you specific setup here. Have also a look at Andrie de Vries post Using R with Jupyter Notebooks that includes instructions for installing Jupyter and IRkernel together.

A good way of using these notebooks is by first cloning the repo, and then starting your Jupyter in pySpark mode. For example, if we have a standalone Spark installation running in our localhost with a maximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specific installation. So as requirement, you need to have Spark installed in the same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passign options described in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY when calling IPython/pySpark.

Datasets

2013 American Community Survey dataset.

Every year, the US Census Bureau runs the American Community Survey. In this survey, approximately 3.5 million households are asked detailed questions about who they are and how they live. Many topics are covered, including ancestry, education, work, transportation, internet use, and residency. You can directly to the source in order to know more about the data and get files for different years, longer periods, individual states, etc.

In any case, the starting up notebook will download the 2013 data locally for later use with the rest of the notebooks.

The idea of using this dataset came from being recently announced in Kaggle as part of their Kaggle scripts datasets. There you will be able to analyse the dataset on site, while sharing your results with other Kaggle users. Highly recommended!

Notebooks

Downloading data and starting with SparkR

Where we download our data locally and start up a SparkR cluster.

SparkSQL basics with SparkR

About loading our data into SparkSQL data frames using SparkR.

Data frame operations with SparkSQL and SparkR

Different operations we can use with SparkR and DataFrame objects, such as data selection and filtering, aggregations, and sorting. The basis for exploratory data analysis and machine learning.

Exploratory Data Analysis with SparkR and ggplot2

How to explore different types of variables using SparkR and ggplot2 charts.

Linear Models with SparkR

About linear models using SparkR, its uses and current limitations in v1.5.

Applications

Exploring geographical data with SparkR and ggplot2

An Exploratory Data Analysis of the 2013 American Community Survey dataset, more concretely its geographical features.

Contributing

Contributions are welcome! For bug reports or requests please submit an issue.

Contact

Feel free to contact me to discuss any issues, questions, or comments.

License

This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.

The content developed by Jose A. Dianes is distributed under the following license:

Copyright 2016 Jose A Dianes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].