All Categories → Data Processing → pyspark

Top 95 pyspark open source projects

MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization

✭ 253

python machine-learning kubernetes artificial-intelligence pipeline cassandra front-end-development pyspark user-experience

Quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

✭ 217

python apache-spark pyspark

Gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

✭ 216

python scala elasticsearch kafka spark big-data cassandra jdbc hbase pyspark paypal spark-streaming

Mmlspark

Simple and Distributed Machine Learning

Spark Practice

Apache Spark (PySpark) Practice on Real Data

✭ 200

jupyter-notebook spark pyspark

Spark Nlp

State of the Art Natural Language Processing

Spark Iforest

Isolation Forest on Spark

✭ 166

scala spark anomaly-detection pyspark

Azure Cosmosdb Spark

Apache Spark Connector for Azure Cosmos DB

✭ 165

scala jupyter-notebook spark apache-spark pyspark connector

Linkis

Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.

Handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes

✭ 158

python jupyter-notebook visualization spark pandas pyspark exploratory-data-analysis

Learningapachespark

LearningApacheSpark

✭ 155

python html tutorial spark latex pyspark

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

✭ 150

python jupyter-notebook machine-learning database sql spark analytics big-data hadoop apache parallel-computing distributed-computing apache-spark dataframe pyspark hdfs

Cc Pyspark

Process Common Crawl data with Python and Spark

✭ 147

python spark pyspark

Pyspark Learning

Updated repository

✭ 147

jupyter-notebook spark pyspark spark-streaming

Repo 2019

BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics

✭ 133

jupyter-notebook tensorflow anomaly-detection sql-server keras-tensorflow raspberry-pi-3 pyspark

Butterfree

A tool for building feature stores.

✭ 126

python data-science package etl data-engineering pyspark etl-framework

Eat pyspark in 10 days

pyspark🍒🥭 is delicious，just eat it!😋😋

✭ 116

python spark pyspark

Pyspark Cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

✭ 108

data-science documentation data spark cheatsheet guide docs reference pyspark cheat cheatsheets quickstart guides

Hnswlib

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

✭ 108

java scala algorithm spark pyspark

Pyspark Stubs

Apache (Py)Spark type annotations (stub files).

✭ 98

python apache-spark pyspark

Relation extraction

Relation Extraction using Deep learning(CNN)

✭ 96

python tensorflow nlp spark relation-extraction pyspark

Spark Py Notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

✭ 1,338

python jupyter-notebook machine-learning data-science spark data-analysis big-data notebook bigdata ipython pyspark ipython-notebook

Pyspark Tutorial

PySpark Code for Hands-on Learners

✭ 91

jupyter-notebook pyspark

Bitcoin Value Predictor

[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin

✭ 91

jupyter-notebook machine-learning bitcoin big-data prediction pyspark stock-price-prediction

Spark python ml examples

Spark 2.0 Python Machine Learning examples

✭ 87

python machine-learning aws spark kaggle pyspark

W2v

Word2Vec models with Twitter data using Spark. Blog:

✭ 64

jupyter-notebook machine-learning data-science spark twitter pyspark

Pysparkgeoanalysis

🌐 Interactive Workshop on GeoAnalysis using PySpark

✭ 63

jupyter-notebook docker spark pyspark

Petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

✭ 1,108

python deep-learning machine-learning pytorch tensorflow pyspark parquet

Awesome Spark

A curated list of awesome Apache Spark packages and resources.

✭ 1,061

awesome apache-spark pyspark

Optimus

🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

✭ 986

jupyter-notebook machine-learning data-science spark data-analysis bigdata pyspark data-cleaning data-wrangling

Sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

✭ 954

python jupyter-notebook spark jupyter kernel cluster notebook magic pyspark pandas-dataframe sql-query

Live log analyzer spark

Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.

✭ 14

python spark analytics apache-spark pyspark

Sparkling Titanic

Training models with Apache Spark, PySpark for Titanic Kaggle competition

✭ 12

python spark pyspark

Pyspark Setup Demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

✭ 24

python jupyter-notebook docker jupyter big-data pyspark

Spark Tdd Example

A simple Spark TDD example

✭ 23

python jupyter-notebook spark tdd pyspark

Cluster Pack

A library on top of either pex or conda-pack to make your Python code easily available on a cluster

✭ 23

python s3 pyspark hdfs

Scriptis

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

✭ 696

scala vue sql spark ide hive pyspark hue

Pyspark Example Project

Example project implementing best practices for PySpark ETL jobs and applications.

✭ 633

python data-science spark etl data-engineering pyspark

Spark Syntax

This is a repo documenting the best practices in PySpark.

✭ 412

jupyter-notebook best-practices pyspark

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

✭ 406

python docker linux json aws devops elasticsearch spark travis-ci hadoop gcp cloudformation solr hbase avro pyspark hdfs parquet

Pyspark Boilerplate

A boilerplate for writing PySpark Jobs

✭ 318

python boilerplate apache-spark pyspark

Spark Gotchas

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

✭ 308

book guide apache-spark pyspark

Tdigest

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

✭ 274

python distributed-computing pyspark mapreduce

basin

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

✭ 25