All Projects → PacktPublishing → Learning Pyspark

PacktPublishing / Learning Pyspark

Licence: mit
Code repository for Learning PySpark by Packt

Projects that are alternatives of or similar to Learning Pyspark

Pystacknet
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook
Smt
Surrogate Modeling Toolbox
Stars: ✭ 233 (+0%)
Mutual labels:  jupyter-notebook
Jupyterwith
declarative and reproducible Jupyter environments - powered by Nix
Stars: ✭ 235 (+0.86%)
Mutual labels:  jupyter-notebook
Awesome Pandas
A collection of resources for pandas (Python) and related subjects.
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook
Nn
🧑‍🏫 50! Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, ...), gans(cyclegan, stylegan2, ...), 🎮 reinforcement learning (ppo, dqn), capsnet, distillation, ... 🧠
Stars: ✭ 5,720 (+2354.94%)
Mutual labels:  jupyter-notebook
Whilemymcmcgentlysamples
my blog
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook
Wassdistance
Approximating Wasserstein distances with PyTorch
Stars: ✭ 229 (-1.72%)
Mutual labels:  jupyter-notebook
Pydqc
python automatic data quality check toolkit
Stars: ✭ 233 (+0%)
Mutual labels:  jupyter-notebook
Datasets
source{d} datasets ("big code") for source code analysis and machine learning on source code
Stars: ✭ 231 (-0.86%)
Mutual labels:  jupyter-notebook
Socceraction
Convert existing soccer event stream data to SPADL and value player actions
Stars: ✭ 234 (+0.43%)
Mutual labels:  jupyter-notebook
Relevant Search Book
Code and Examples for Relevant Search
Stars: ✭ 231 (-0.86%)
Mutual labels:  jupyter-notebook
Tensorflow 101
TensorFlow Tutorials
Stars: ✭ 2,565 (+1000.86%)
Mutual labels:  jupyter-notebook
My tech resources
List of tech resources future me and other Javascript/Ruby/Python/Elixir/Elm developers might find useful
Stars: ✭ 233 (+0%)
Mutual labels:  jupyter-notebook
Mattnet
MAttNet: Modular Attention Network for Referring Expression Comprehension
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook
Web scraping with python
Python 入门爬虫和数据分析实战
Stars: ✭ 234 (+0.43%)
Mutual labels:  jupyter-notebook
Statannot
add statistical annotations (pvalue significance) on an existing boxplot generated by seaborn boxplot
Stars: ✭ 228 (-2.15%)
Mutual labels:  jupyter-notebook
Pandas Highcharts
Beautiful charting of pandas.DataFrame with Highcharts
Stars: ✭ 233 (+0%)
Mutual labels:  jupyter-notebook
Rl learn
我的强化学习笔记和学习材料📖 still updating ... ...
Stars: ✭ 234 (+0.43%)
Mutual labels:  jupyter-notebook
Datavisualization
Tutorials on visualizing data using python packages like bokeh, plotly, seaborn and igraph
Stars: ✭ 234 (+0.43%)
Mutual labels:  jupyter-notebook
Pyhessian
PyHessian is a Pytorch library for second-order based analysis and training of Neural Networks
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook

Learning PySpark

This is the code repository for Learning PySpark, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish.

About the book

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark.

You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command.

By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

Instructions and Navigation

All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter 03.

The code will look like the following:

    data_key = sc.parallelize( 
         [('a', 4),('b', 3),('c', 2),('a', 8),('d', 2),('b', 1), 
         ('d', 3)],4) 
    data_key.reduceByKey(lambda x, y: x + y).collect() 

Software requirements:

For this book you need a personal computer (can be either Windows machine, Mac, or Linux). To run Apache Spark, you will need Java 7+ and an installed and configured Python 2.6+ or 3.4+ environment; we use the Anaconda distribution of Python in version 3.5, which can be downloaded from https://www.continuum.io/downloads.

The Python modules we randomly use throughout the book come preinstalled with Anaconda. We also use GraphFrames and TensorFrames that can be loaded dynamically while starting a Spark instance: to load these you just need an Internet connection. It is fine if some of those modules are not currently installed on your machine – we will guide you through the installation process.

Note:

Chapter 11 and Bonus Chapter 02 does not contain code files.

Related Products:

Suggestions and Feedback

Click here if you have any feedback or suggestions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].