All Projects → Pyspark Example Project → Similar Projects or Alternatives

1517 Open source projects that are alternatives of or similar to Pyspark Example Project

Butterfree
A tool for building feature stores.
Stars: ✭ 126 (-80.09%)
Mutual labels:  data-science, etl, data-engineering, pyspark
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-87.52%)
Mutual labels:  data-science, spark, etl, data-engineering
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+55.77%)
Mutual labels:  data-science, spark, pyspark
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+677.09%)
Mutual labels:  data-science, etl, data-engineering
Sayn
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Stars: ✭ 79 (-87.52%)
Mutual labels:  data-science, etl, data-engineering
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+111.37%)
Mutual labels:  data-science, spark, pyspark
Pyspark Cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
Stars: ✭ 108 (-82.94%)
Mutual labels:  data-science, spark, pyspark
W2v
Word2Vec models with Twitter data using Spark. Blog:
Stars: ✭ 64 (-89.89%)
Mutual labels:  data-science, spark, pyspark
Geni
A Clojure dataframe library that runs on Spark
Stars: ✭ 152 (-75.99%)
Mutual labels:  data-science, spark, data-engineering
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (-96.05%)
Mutual labels:  spark, etl, pyspark
Spark Alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Stars: ✭ 122 (-80.73%)
Mutual labels:  data-science, spark, data-engineering
Aws Data Wrangler
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Stars: ✭ 2,385 (+276.78%)
Mutual labels:  data-science, etl, data-engineering
Cql
Categorical Query Language IDE
Stars: ✭ 196 (-69.04%)
Mutual labels:  data-science, etl
Gspread Pandas
A package to easily open an instance of a Google spreadsheet and interact with worksheets through Pandas DataFrames.
Stars: ✭ 226 (-64.3%)
Mutual labels:  data-science, data-engineering
Mydatascienceportfolio
Applying Data Science and Machine Learning to Solve Real World Business Problems
Stars: ✭ 227 (-64.14%)
Mutual labels:  data-science, spark
hive-metastore-client
A client for connecting and running DDLs on hive metastore.
Stars: ✭ 37 (-94.15%)
Mutual labels:  etl, data-engineering
Ploomber
A convention over configuration workflow orchestrator. Develop locally (Jupyter or your favorite editor), deploy to Airflow or Kubernetes.
Stars: ✭ 221 (-65.09%)
Mutual labels:  data-science, data-engineering
etl
[READ-ONLY] PHP - ETL (Extract Transform Load) data processing library
Stars: ✭ 279 (-55.92%)
Mutual labels:  etl, data-engineering
morph-kgc
Powerful RDF Knowledge Graph Generation with [R2]RML Mappings
Stars: ✭ 77 (-87.84%)
Mutual labels:  etl, data-engineering
python mozetl
ETL jobs for Firefox Telemetry
Stars: ✭ 25 (-96.05%)
Mutual labels:  etl, pyspark
hamilton
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (-3.32%)
Mutual labels:  etl, data-engineering
gallia-core
A schema-aware Scala library for data transformation
Stars: ✭ 44 (-93.05%)
Mutual labels:  etl, data-engineering
lineage
Generate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (-97.47%)
Mutual labels:  etl, pyspark
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-93.84%)
Mutual labels:  etl, pyspark
jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (-96.05%)
Mutual labels:  pyspark, data-engineering
ODSC India 2018
My presentation at ODSC India 2018 about Deep Learning with Apache Spark
Stars: ✭ 26 (-95.89%)
Mutual labels:  spark, pyspark
pangeo-forge-recipes
Python library for building Pangeo Forge recipes.
Stars: ✭ 64 (-89.89%)
Mutual labels:  etl, data-engineering
Soda Sql
Metric collection, data testing and monitoring for SQL accessible data
Stars: ✭ 173 (-72.67%)
Mutual labels:  data-science, data-engineering
Elastic
R client for the Elasticsearch HTTP API
Stars: ✭ 227 (-64.14%)
Mutual labels:  data-science, etl
Auptimizer
An automatic ML model optimization tool.
Stars: ✭ 166 (-73.78%)
Mutual labels:  data-science, data-engineering
AirflowETL
Blog post on ETL pipelines with Airflow
Stars: ✭ 20 (-96.84%)
Mutual labels:  etl, data-engineering
Koalas
Koalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+380.88%)
Mutual labels:  data-science, spark
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Stars: ✭ 58 (-90.84%)
Mutual labels:  pyspark, data-engineering
Scalable Data Science Platform
Content for architecting a data science platform for products using Luigi, Spark & Flask.
Stars: ✭ 158 (-75.04%)
Mutual labels:  data-science, spark
uptasticsearch
An Elasticsearch client tailored to data science workflows.
Stars: ✭ 47 (-92.58%)
Mutual labels:  etl, data-engineering
blockchain-etl-streaming
Streaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes
Stars: ✭ 57 (-91%)
Mutual labels:  etl, data-engineering
versatile-data-kit
Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.
Stars: ✭ 144 (-77.25%)
Mutual labels:  etl, data-engineering
polygon-etl
ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (-91.63%)
Mutual labels:  etl, data-engineering
DataEngineering
This repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (-92.58%)
Mutual labels:  pyspark, data-engineering
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-97.31%)
Mutual labels:  etl, pyspark
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (-92.1%)
Mutual labels:  spark, pyspark
spark-extension
A library that provides useful extensions to Apache Spark and PySpark.
Stars: ✭ 25 (-96.05%)
Mutual labels:  spark, pyspark
beneath
Beneath is a serverless real-time data platform ⚡️
Stars: ✭ 65 (-89.73%)
Mutual labels:  etl, data-engineering
arthur-redshift-etl
ELT Code for your Data Warehouse
Stars: ✭ 22 (-96.52%)
Mutual labels:  etl, data-engineering
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-82.46%)
Mutual labels:  spark, pyspark
data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Stars: ✭ 34 (-94.63%)
Mutual labels:  spark, pyspark
AirflowDataPipeline
Example of an ETL Pipeline using Airflow
Stars: ✭ 24 (-96.21%)
Mutual labels:  etl, data-engineering
Around Dataengineering
A Data Engineering & Machine Learning Knowledge Hub
Stars: ✭ 257 (-59.4%)
Mutual labels:  spark, data-engineering
Sk Dist
Distributed scikit-learn meta-estimators in PySpark
Stars: ✭ 260 (-58.93%)
Mutual labels:  data-science, spark
Datavec
ETL Library for Machine Learning - data pipelines, data munging and wrangling
Stars: ✭ 272 (-57.03%)
Mutual labels:  spark, etl
Spark Notebook
Interactive and Reactive Data Science using Scala and Spark.
Stars: ✭ 3,081 (+386.73%)
Mutual labels:  data-science, spark
incubator-linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,459 (+288.47%)
Mutual labels:  spark, pyspark
etl manager
A python package to create a database on the platform using our moj data warehousing framework
Stars: ✭ 14 (-97.79%)
Mutual labels:  etl, data-engineering
Benthos
Fancy stream processing made operationally mundane
Stars: ✭ 3,705 (+485.31%)
Mutual labels:  etl, data-engineering
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+793.52%)
Mutual labels:  data-science, spark
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (-42.97%)
Mutual labels:  spark, etl
Learn Something Every Day
📝 A compilation of everything that I learn; Computer Science, Software Development, Engineering, Math, and Coding in General. Read the rendered results here ->
Stars: ✭ 362 (-42.81%)
Mutual labels:  data-science, data-engineering
Wedatasphere
WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!
Stars: ✭ 372 (-41.23%)
Mutual labels:  spark, etl
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (-35.86%)
Mutual labels:  spark, pyspark
Dataform
Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift
Stars: ✭ 342 (-45.97%)
Mutual labels:  etl, data-engineering
1-60 of 1517 similar projects