Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → svenkreiss → Pysparkling

svenkreiss / Pysparkling

Licence: other

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

data-science apache-spark data-processing

Projects that are alternatives of or similar to Pysparkling

Pulsar Spark

When Apache Pulsar meets Apache Spark

Stars: ✭ 55 (-76.19%)

Mutual labels: data-science, apache-spark, data-processing

Hub

Dataset format for AI. Build, manage, & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai

Stars: ✭ 4,003 (+1632.9%)

Mutual labels: data-science, data-processing

Spark Notebook

Interactive and Reactive Data Science using Scala and Spark.

Stars: ✭ 3,081 (+1233.77%)

Mutual labels: data-science, apache-spark

Agile data code 2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Stars: ✭ 413 (+78.79%)

Mutual labels: data-science, apache-spark

Dist Keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

Stars: ✭ 613 (+165.37%)

Mutual labels: data-science, apache-spark

Scalable Data Science

Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.

Stars: ✭ 142 (-38.53%)

Mutual labels: data-science, apache-spark

Awesome Kafka

A list about Apache Kafka

Stars: ✭ 397 (+71.86%)

Mutual labels: apache-spark, data-processing

Data Science On Gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

Stars: ✭ 864 (+274.03%)

Mutual labels: data-science, data-processing

Dataflowjavasdk

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

Stars: ✭ 854 (+269.7%)

Mutual labels: data-science, data-processing

Griffon Vm

Griffon Data Science Virtual Machine

Stars: ✭ 128 (-44.59%)

Mutual labels: data-science, apache-spark

Collapse

Advanced and Fast Data Transformation in R

Stars: ✭ 184 (-20.35%)

Mutual labels: data-science, data-processing

Dash

Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.

Stars: ✭ 15,592 (+6649.78%)

Mutual labels: data-science

Awesome Ai Infrastructures

Infrastructures™ for Machine Learning Training/Inference in Production.

Stars: ✭ 223 (-3.46%)

Mutual labels: apache-spark

Spark Workshop

Apache Spark™ and Scala Workshops

Stars: ✭ 224 (-3.03%)

Mutual labels: apache-spark

Statistical Learning

Lecture Slides and R Sessions for Trevor Hastie and Rob Tibshinari's "Statistical Learning" Stanford course

Stars: ✭ 223 (-3.46%)

Mutual labels: data-science

Mydatascienceportfolio

Applying Data Science and Machine Learning to Solve Real World Business Problems

Stars: ✭ 227 (-1.73%)

Mutual labels: data-science

Streamlit

Streamlit — The fastest way to build data apps in Python

Stars: ✭ 16,906 (+7218.61%)

Mutual labels: data-science

Machine Learning Notebooks

Machine Learning notebooks for refreshing concepts.

Stars: ✭ 222 (-3.9%)

Mutual labels: data-processing

Jupyterlab templates

Support for jupyter notebook templates in jupyterlab

Stars: ✭ 223 (-3.46%)

Mutual labels: data-science

Ml Workspace

Machine Learning (Beginners Hub), information(courses, books, cheat sheets, live sessions) related to machine learning, data science and python is available

Stars: ✭ 221 (-4.33%)

Mutual labels: data-science

View All Similar Projects ➔

.. image:: https://raw.githubusercontent.com/svenkreiss/pysparkling/master/logo/logo-w100.png :target: https://github.com/svenkreiss/pysparkling

pysparkling

Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to execute entirely in Python, without incurring the overhead of initializing and passing data through the JVM and Hadoop. The focus is on having a lightweight and fast implementation for small datasets at the expense of some data resilience features and some parallel processing features.

How does it work? To switch execution of a script from PySpark to pysparkling, have the code initialize a pysparkling Context instead of a SparkContext, and use the pysparkling Context to set up your RDDs. The beauty is you don't have to change a single line of code after the Context initialization, because pysparkling's API is (almost) exactly the same as PySpark's. Since it's so easy to switch between PySpark and pysparkling, you can choose the right tool for your use case.

When would I use it? Say you are writing a Spark application because you need robust computation on huge datasets, but you also want the same application to provide fast answers on a small dataset. You're finding Spark is not responsive enough for your needs, but you don't want to rewrite an entire separate application for the small-answers-fast problem. You'd rather reuse your Spark code but somehow get it to run fast. Pysparkling bypasses the stuff that causes Spark's long startup times and less responsive feel.

Here are a few areas where pysparkling excels:

Small to medium-scale exploratory data analysis
Application prototyping
Low-latency web deployments
Unit tests

Install

.. code-block:: bash

pip install pysparkling[s3,hdfs,streaming]

Documentation <https://pysparkling.trivial.io>_:

.. image:: https://raw.githubusercontent.com/svenkreiss/pysparkling/master/docs/readthedocs.png :target: https://pysparkling.trivial.io

.. |pypi-badge| image:: https://badge.fury.io/py/pysparkling.svg :target: https://pypi.python.org/pypi/pysparkling/ .. |test-badge| image:: https://github.com/svenkreiss/pysparkling/workflows/Tests/badge.svg :target: https://github.com/svenkreiss/pysparkling/actions?query=workflow%3ATests .. |docs-badge| image:: https://readthedocs.org/projects/pysparkling/badge/?version=latest :target: https://pysparkling.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

Features

Supports URI schemes s3://, hdfs://, gs://, http:// and file:// for Amazon S3, HDFS, Google Storage, web and local file access. Specify multiple files separated by comma. Resolves * and ? wildcards.
Handles .gz, .zip, .lzma, .xz, .bz2, .tar, .tar.gz and .tar.bz2 compressed files. Supports reading of .7z files.
Parallelization via multiprocessing.Pool, concurrent.futures.ThreadPoolExecutor or any other Pool-like objects that have a map(func, iterable) method.
Plain pysparkling does not have any dependencies (use pip install pysparkling). Some file access methods have optional dependencies: boto for AWS S3, requests for http, hdfs for hdfs

Examples

Some demos are in the notebooks docs/demo.ipynb <https://github.com/svenkreiss/pysparkling/blob/master/docs/demo.ipynb>_ and docs/iris.ipynb <https://github.com/svenkreiss/pysparkling/blob/master/docs/iris.ipynb>_ .

Word Count

.. code-block:: python

from pysparkling import Context

counts = (
    Context()
    .textFile('README.rst')
    .map(lambda line: ''.join(ch if ch.isalnum() else ' ' for ch in line))
    .flatMap(lambda line: line.split(' '))
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)
print(counts.collect())

which prints a long list of pairs of words and their counts.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 231

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (7) 🔗