Spark With PythonFundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-1.32%)
KoalasKoalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+1902.63%)
AcceleratorThe Accelerator is a tool for fast and reproducible processing of large amounts of data.
Stars: ✭ 137 (-9.87%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-48.03%)
FeastFeature Store for Machine Learning
Stars: ✭ 2,576 (+1594.74%)
Spark Py NotebooksApache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+780.26%)
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-26.97%)
pyspark-algorithmsPySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (-52.63%)
RsparklingRSparkling: Use H2O Sparkling Water from R (Spark + R + Machine Learning)
Stars: ✭ 65 (-57.24%)
Just Dashboard📊 📋 Dashboards using YAML or JSON files
Stars: ✭ 1,511 (+894.08%)
Spark AlchemyCollection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Stars: ✭ 122 (-19.74%)
MetorikkuA simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (+137.5%)
H2o 3H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+3621.05%)
Data Science Ipython NotebooksData science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+14405.26%)
Pyspark Example ProjectExample project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+316.45%)
VerticapyVerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
Stars: ✭ 59 (-61.18%)
Spark.jlJulia binding for Apache Spark
Stars: ✭ 153 (+0.66%)
DatasciencevmTools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)
Stars: ✭ 153 (+0.66%)
SaynData processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Stars: ✭ 79 (-48.03%)
Applied Ml📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Stars: ✭ 17,824 (+11626.32%)
LogislandScalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-36.18%)
Rumble⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-61.84%)
WaimakWaimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Stars: ✭ 60 (-60.53%)
PwrakeParallel Workflow extension for Rake, runs on multicores, clusters, clouds.
Stars: ✭ 57 (-62.5%)
LabsResearch on distributed system
Stars: ✭ 73 (-51.97%)
Danfojsdanfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.
Stars: ✭ 1,304 (+757.89%)
ParapetA purely functional library to build distributed and event-driven systems
Stars: ✭ 106 (-30.26%)
W2vWord2Vec models with Twitter data using Spark. Blog:
Stars: ✭ 64 (-57.89%)
DrakeAn R-focused pipeline toolkit for reproducibility and high-performance computing
Stars: ✭ 1,301 (+755.92%)
CookbookThe Data Engineering Cookbook
Stars: ✭ 9,829 (+6366.45%)
BoincOpen-source software for volunteer computing and grid computing.
Stars: ✭ 1,320 (+768.42%)
VizukaExplore high-dimensional datasets and how your algo handles specific regions.
Stars: ✭ 100 (-34.21%)
Drake ExamplesExample workflows for the drake R package
Stars: ✭ 57 (-62.5%)
BigdataclassTwo-day workshop that covers how to use R to interact databases and Spark
Stars: ✭ 110 (-27.63%)
Spark R Notebooks R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 109 (-28.29%)
ElephasDistributed Deep learning with Keras & Spark
Stars: ✭ 1,521 (+900.66%)
DatacompyPandas and Spark DataFrame comparison for humans
Stars: ✭ 147 (-3.29%)
Pyspark Cheatsheet🐍 Quick reference guide to common patterns & functions in PySpark.
Stars: ✭ 108 (-28.95%)
Python BigdataData science and Big Data with Python
Stars: ✭ 112 (-26.32%)
Pythondatarepo for code published on pythondata.com
Stars: ✭ 113 (-25.66%)
Pyhpc BenchmarksA suite of benchmarks to test the sequential CPU and GPU performance of most popular high-performance libraries for Python.
Stars: ✭ 119 (-21.71%)
D6t PythonAccelerate data science
Stars: ✭ 118 (-22.37%)
OpencoarraysA parallel application binary interface for Fortran 2018 compilers.
Stars: ✭ 151 (-0.66%)
SupersetApache Superset is a Data Visualization and Data Exploration Platform
Stars: ✭ 42,634 (+27948.68%)
Aws Data WranglerPandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Stars: ✭ 2,385 (+1469.08%)
Cape PythonCollaborate on privacy-preserving policy for data science projects in Pandas and Apache Spark
Stars: ✭ 125 (-17.76%)
BatchtoolsTools for computation on batch systems
Stars: ✭ 127 (-16.45%)
Griffon VmGriffon Data Science Virtual Machine
Stars: ✭ 128 (-15.79%)
PipelinexPipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more
Stars: ✭ 127 (-16.45%)
GafferA large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+980.26%)