Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → jadianes → Spark Py Notebooks

jadianes / Spark Py Notebooks

Licence: other

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Programming Languages

python

139335 projects - #7 most used programming language

Labels

jupyter-notebook machine-learning data-science spark data-analysis big-data notebook bigdata ipython pyspark ipython-notebook

Projects that are alternatives of or similar to Spark Py Notebooks

Spark R Notebooks

R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks

Stars: ✭ 109 (-91.85%)

Mutual labels: jupyter-notebook, data-science, data-analysis, big-data, notebook, bigdata

Optimus

🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Stars: ✭ 986 (-26.31%)

Mutual labels: jupyter-notebook, data-science, spark, data-analysis, bigdata, pyspark

Sci Pype

A Machine Learning API with native redis caching and export + import using S3. Analyze entire datasets using an API for building, training, testing, analyzing, extracting, importing, and archiving. This repository can run from a docker container or from the repository.

Stars: ✭ 90 (-93.27%)

Mutual labels: ipython, jupyter-notebook, data-science, ipython-notebook

Cookbook 2nd Code

Code of the IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018 [read-only repository]

Stars: ✭ 541 (-59.57%)

Mutual labels: ipython, jupyter-notebook, data-science, data-analysis

My Journey In The Data Science World

📢 Ready to learn or review your knowledge!

Stars: ✭ 1,175 (-12.18%)

Mutual labels: jupyter-notebook, data-science, data-analysis, big-data

leaflet heatmap

简单的可视化湖州通话数据假设数据量很大，没法用浏览器直接绘制热力图，把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后，再使用Apache Spark绘制热力图，然后用leafletjs加载OpenStreetMap图层和热力图图层，以达到良好的交互效果。现在使用Apache Spark实现绘制，可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法，并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .

Stars: ✭ 13 (-99.03%)

Mutual labels: big-data, spark, bigdata, data-analysis

Quantitative Notebooks

Educational notebooks on quantitative finance, algorithmic trading, financial modelling and investment strategy

Stars: ✭ 356 (-73.39%)

Mutual labels: jupyter-notebook, data-science, data-analysis, notebook

Big Data Engineering Coursera Yandex

Big Data for Data Engineers Coursera Specialization from Yandex

Stars: ✭ 71 (-94.69%)

Mutual labels: jupyter-notebook, spark, big-data, bigdata

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (-88.79%)

Mutual labels: jupyter-notebook, spark, big-data, pyspark

Cookbook 2nd

IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018

Stars: ✭ 704 (-47.38%)

Mutual labels: ipython, jupyter-notebook, data-science, data-analysis

Nteract

📘 The interactive computing suite for you! ✨

Stars: ✭ 5,713 (+326.98%)

Mutual labels: ipython, jupyter-notebook, data-science, notebook

Spark Movie Lens

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

Stars: ✭ 745 (-44.32%)

Mutual labels: jupyter-notebook, spark, big-data, bigdata

W2v

Word2Vec models with Twitter data using Spark. Blog:

Stars: ✭ 64 (-95.22%)

Mutual labels: jupyter-notebook, data-science, spark, pyspark

Dtale

Visualizer for pandas data structures

Stars: ✭ 2,864 (+114.05%)

Mutual labels: ipython, jupyter-notebook, data-science, data-analysis

Courses

Quiz & Assignment of Coursera

Stars: ✭ 454 (-66.07%)

Mutual labels: jupyter-notebook, data-science, data-analysis, big-data

Datasciencevm

Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)

Stars: ✭ 153 (-88.57%)

Mutual labels: jupyter-notebook, data-science, data-analysis, big-data

Data Analysis And Machine Learning Projects

Repository of teaching materials, code, and data for my data analysis and machine learning projects.

Stars: ✭ 5,166 (+286.1%)

Mutual labels: jupyter-notebook, data-science, data-analysis, ipython-notebook

Pythondata

repo for code published on pythondata.com

Stars: ✭ 113 (-91.55%)

Mutual labels: jupyter-notebook, data-science, data-analysis, big-data

H2o 3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Stars: ✭ 5,656 (+322.72%)

Mutual labels: jupyter-notebook, data-science, spark, big-data

Sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

Stars: ✭ 954 (-28.7%)

Mutual labels: jupyter-notebook, spark, notebook, pyspark

View All Similar Projects ➔

Spark Python Notebooks

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language.

If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

Instructions

A good way of using these notebooks is by first cloning the repo, and then starting your own IPython notebook/Jupyter in pySpark mode. For example, if we have a standalone Spark installation running in our localhost with a maximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specific installation. So as requirement, you need to have Spark installed in the same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passing options described in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY when calling IPython/pySpark.

Datasets

We will be using datasets from the KDD Cup 1999. The results of this competition can be found here.

References

The reference book for these and other Spark related topics is:

Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.

Notebooks

The following notebooks can be examined individually, although there is a more or less linear 'story' when followed in sequence. By using the same dataset they try to solve a related set of tasks with it.

RDD creation

About reading files and parallelize.

RDDs basics

A look at map, filter, and collect.

Sampling RDDs

RDD sampling methods explained.

RDD set operations

Brief introduction to some of the RDD pseudo-set operations.

Data aggregations on RDDs

RDD actions reduce, fold, and aggregate.

Working with key/value pair RDDs

How to deal with key/value pairs in order to aggregate and explore data.

MLlib: Basic Statistics and Exploratory Data Analysis

A notebook introducing Local Vector types, basic statistics in MLlib for Exploratory Data Analysis and model selection.

MLlib: Logistic Regression

Labeled points and Logistic Regression classification of network attacks in MLlib. Application of model selection techniques using correlation matrix and Hypothesis Testing.

MLlib: Decision Trees

Use of tree-based methods and how they help explaining models and feature selection.

Spark SQL: structured processing for Data Analysis

In this notebook a schema is inferred for our network interactions dataset. Based on that, we use Spark's SQL DataFrame abstraction to perform a more structured exploratory data analysis.

Applications

Beyond the basics. Close to real-world applications using Spark and other technologies.

Olssen: On-line Spectral Search ENgine for proteomics

Same tech stack this time with an AngularJS client app.

An on-line movie recommendation web service

This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit.

There I've added with minor modifications to use a larger dataset and also code about how to store and reload the model for later use. On top of that we build a Flask web service so the recommender can be use to provide movie recommendations on-line.

KDD Cup 1999

My try using Spark with this classic dataset and Knowledge Discovery competition.

Contributing

Contributions are welcome! For bug reports or requests please submit an issue.

Contact

Feel free to contact me to discuss any issues, questions, or comments.

Twitter: @ja_dianes
GitHub: jadianes
LinkedIn: jadianes
Website: jadianes.me

License

This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.

The content developed by Jose A. Dianes is distributed under the following license:

Copyright 2016 Jose A Dianes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 1,338

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (8) 🔗