Top 95 pyspark open source projects

Morphl Community Edition
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Spark Practice
Apache Spark (PySpark) Practice on Real Data
Spark Iforest
Isolation Forest on Spark
Azure Cosmosdb Spark
Apache Spark Connector for Azure Cosmos DB
Linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Handyspark
HandySpark - bringing pandas-like capabilities to Spark dataframes
Cc Pyspark
Process Common Crawl data with Python and Spark
Repo 2019
BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
Eat pyspark in 10 days
pyspark🍒🥭 is delicious,just eat it!😋😋
Hnswlib
Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Pyspark Stubs
Apache (Py)Spark type annotations (stub files).
Relation extraction
Relation Extraction using Deep learning(CNN)
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Pyspark Tutorial
PySpark Code for Hands-on Learners
Bitcoin Value Predictor
[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin
Spark python ml examples
Spark 2.0 Python Machine Learning examples
W2v
Word2Vec models with Twitter data using Spark. Blog:
Pysparkgeoanalysis
🌐 Interactive Workshop on GeoAnalysis using PySpark
Petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Awesome Spark
A curated list of awesome Apache Spark packages and resources.
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Sparkling Titanic
Training models with Apache Spark, PySpark for Titanic Kaggle competition
Pyspark Setup Demo
Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks
Cluster Pack
A library on top of either pex or conda-pack to make your Python code easily available on a cluster
Scriptis
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Spark Syntax
This is a repo documenting the best practices in PySpark.
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Pyspark Boilerplate
A boilerplate for writing PySpark Jobs
Spark Gotchas
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
spark-extension
A library that provides useful extensions to Apache Spark and PySpark.
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
incubator-linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Spark-and-Kafka IoT-Data-Processing-and-Analytics
Final Project for IoT: Big Data Processing and Analytics class. Analyzing U.S nationwide temperature from IoT sensors in real-time
Azure-Databricks-NYC-Taxi-Workshop
An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset
ODSC India 2018
My presentation at ODSC India 2018 about Deep Learning with Apache Spark
pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
dlsa
Distributed least squares approximation (dlsa) implemented with Apache Spark
1-60 of 95 pyspark projects