datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-82.03%)
isarn-sketches-sparkRoutines and data structures for using isarn-sketches idiomatically in Apache Spark
Stars: ✭ 28 (-87.1%)
Awesome SparkA curated list of awesome Apache Spark packages and resources.
Stars: ✭ 1,061 (+388.94%)
pyspark-cheatsheetPySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (-47%)
Pyspark StubsApache (Py)Spark type annotations (stub files).
Stars: ✭ 98 (-54.84%)
spark3DSpark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
Stars: ✭ 23 (-89.4%)
mmtf-workshop-2018Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (-76.96%)
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-48.85%)
jupyterlab-sparkmonitorJupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
Stars: ✭ 78 (-64.06%)
MmlsparkSimple and Distributed Machine Learning
Stars: ✭ 2,899 (+1235.94%)
Spark With PythonFundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-30.88%)
SynapseMLSimple and Distributed Machine Learning
Stars: ✭ 3,355 (+1446.08%)
SparkoraPowerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (-76.5%)
Spark GotchasSpark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Stars: ✭ 308 (+41.94%)
Live log analyzer sparkSpark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-93.55%)
Pyspark Cheatsheet🐍 Quick reference guide to common patterns & functions in PySpark.
Stars: ✭ 108 (-50.23%)
Spark On K8s OperatorKubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Stars: ✭ 1,780 (+720.28%)
Spark NlpState of the Art Natural Language Processing
Stars: ✭ 2,518 (+1060.37%)
AlbedoA recommender system for discovering GitHub repos, built with Apache Spark
Stars: ✭ 149 (-31.34%)
Docker SparkApache Spark docker image
Stars: ✭ 1,396 (+543.32%)
ButterfreeA tool for building feature stores.
Stars: ✭ 126 (-41.94%)
HandysparkHandySpark - bringing pandas-like capabilities to Spark dataframes
Stars: ✭ 158 (-27.19%)
Bigdata PlaygroundA complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (-18.43%)
HnswlibJava library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Stars: ✭ 108 (-50.23%)
SplashSplash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Stars: ✭ 105 (-51.61%)
Cc PysparkProcess Common Crawl data with Python and Spark
Stars: ✭ 147 (-32.26%)
Spark Py NotebooksApache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+516.59%)
Bitcoin Value Predictor[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin
Stars: ✭ 91 (-58.06%)
ParquetviewerSimple windows desktop application for viewing & querying Apache Parquet files
Stars: ✭ 145 (-33.18%)
SparkrdmaRDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (-0.92%)
Spark PracticeApache Spark (PySpark) Practice on Real Data
Stars: ✭ 200 (-7.83%)
CuesheetA framework for writing Spark 2.x applications in a pretty way
Stars: ✭ 86 (-60.37%)
OryxOryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Stars: ✭ 1,785 (+722.58%)
Spark StatesCustom state store providers for Apache Spark
Stars: ✭ 83 (-61.75%)
MlflowOpen source platform for the machine learning lifecycle
Stars: ✭ 10,898 (+4922.12%)
HydrographA visual ETL development and debugging tool for big data
Stars: ✭ 144 (-33.64%)
W2vWord2Vec models with Twitter data using Spark. Blog:
Stars: ✭ 64 (-70.51%)
Pysparkgeoanalysis🌐 Interactive Workshop on GeoAnalysis using PySpark
Stars: ✭ 63 (-70.97%)
Whylogs JavaProfile and monitor your ML data pipeline end-to-end
Stars: ✭ 164 (-24.42%)
Scalable Data ScienceScalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.
Stars: ✭ 142 (-34.56%)
PetastormPetastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Stars: ✭ 1,108 (+410.6%)
Awesome PulsarA curated list of Pulsar tools, integrations and resources.
Stars: ✭ 57 (-73.73%)
Azure Event Hubs SparkEnabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (-35.48%)
Pulsar SparkWhen Apache Pulsar meets Apache Spark
Stars: ✭ 55 (-74.65%)
Analytics ZooDistributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
Stars: ✭ 2,448 (+1028.11%)
Spark Atlas ConnectorA Spark Atlas connector to track data lineage in Apache Atlas
Stars: ✭ 160 (-26.27%)