Morphl Community EditionMorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Quinnpyspark methods to enhance developer productivity 📣 👯 🎉
GimelBig Data Processing Framework - Unified Data API or SQL on Any Storage
MmlsparkSimple and Distributed Machine Learning
Spark NlpState of the Art Natural Language Processing
LinkisLinkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
HandysparkHandySpark - bringing pandas-like capabilities to Spark dataframes
Cc PysparkProcess Common Crawl data with Python and Spark
Repo 2019BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
HnswlibJava library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Spark Py NotebooksApache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Bitcoin Value Predictor[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin
W2vWord2Vec models with Twitter data using Spark. Blog:
PetastormPetastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Awesome SparkA curated list of awesome Apache Spark packages and resources.
Optimus🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
SparkmagicJupyter magics and kernels for working with remote Spark clusters
Live log analyzer sparkSpark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Sparkling TitanicTraining models with Apache Spark, PySpark for Titanic Kaggle competition
Cluster PackA library on top of either pex or conda-pack to make your Python code easily available on a cluster
ScriptisScriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Spark SyntaxThis is a repo documenting the best practices in PySpark.
Devops Python Tools80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Spark GotchasSpark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Tdigestt-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
basinBasin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
spark-extensionA library that provides useful extensions to Apache Spark and PySpark.
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
incubator-linkisLinkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
ODSC India 2018My presentation at ODSC India 2018 about Deep Learning with Apache Spark
pyspark-cheatsheetPySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
big dataA collection of tutorials on Hadoop, MapReduce, Spark, Docker
dlsaDistributed least squares approximation (dlsa) implemented with Apache Spark