All Projects → Pyspark Example Project → Similar Projects or Alternatives

1517 Open source projects that are alternatives of or similar to Pyspark Example Project

Goodreads etl pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Stars: ✭ 793 (+25.28%)

Mutual labels: spark, data-engineering

Spark python ml examples

Spark 2.0 Python Machine Learning examples

Stars: ✭ 87 (-86.26%)

Mutual labels: spark, pyspark

Dataspherestudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

Stars: ✭ 1,195 (+88.78%)

Mutual labels: spark, etl

Hnswlib

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

Stars: ✭ 108 (-82.94%)

Mutual labels: spark, pyspark

Scriptis

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

Stars: ✭ 696 (+9.95%)

Mutual labels: spark, pyspark

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (-76.3%)

Mutual labels: spark, pyspark

Cc Pyspark

Process Common Crawl data with Python and Spark

Stars: ✭ 147 (-76.78%)

Mutual labels: spark, pyspark

Linkis

Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.

Stars: ✭ 2,323 (+266.98%)

Mutual labels: spark, pyspark

Pyspark Learning

Updated repository

Stars: ✭ 147 (-76.78%)

Mutual labels: spark, pyspark

Spark Practice

Apache Spark (PySpark) Practice on Real Data

Stars: ✭ 200 (-68.4%)

Mutual labels: spark, pyspark

Spark Nlp

State of the Art Natural Language Processing

Stars: ✭ 2,518 (+297.79%)

Mutual labels: spark, pyspark

Every Single Day I Tldr

A daily digest of the articles or videos I've found interesting, that I want to share with you.

Stars: ✭ 249 (-60.66%)

Mutual labels: spark, data-engineering

Aws Serverless Data Lake Framework

Enterprise-grade, production-hardened, serverless data lake on AWS

Stars: ✭ 179 (-71.72%)

Mutual labels: etl, data-engineering

Pixiedust

Python Helper library for Jupyter Notebooks

Stars: ✭ 998 (+57.66%)

Mutual labels: data-science, spark

data-algorithms-with-spark

O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian

Stars: ✭ 34 (-94.63%)

Mutual labels: spark, pyspark

Data Science On Gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

Stars: ✭ 864 (+36.49%)

Mutual labels: data-science, data-engineering

AirflowDataPipeline

Example of an ETL Pipeline using Airflow

Stars: ✭ 24 (-96.21%)

Mutual labels: etl, data-engineering

etl manager

A python package to create a database on the platform using our moj data warehousing framework

Stars: ✭ 14 (-97.79%)

Mutual labels: etl, data-engineering

Etl with python

ETL with Python - Taught at DWH course 2017 (TAU)

Stars: ✭ 68 (-89.26%)

Mutual labels: data-science, etl

Benthos

Fancy stream processing made operationally mundane

Stars: ✭ 3,705 (+485.31%)

Mutual labels: etl, data-engineering

Sk Dist

Distributed scikit-learn meta-estimators in PySpark

Stars: ✭ 260 (-58.93%)

Mutual labels: data-science, spark

Spark Notebook

Interactive and Reactive Data Science using Scala and Spark.

Stars: ✭ 3,081 (+386.73%)

Mutual labels: data-science, spark

Dataform

Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift

Stars: ✭ 342 (-45.97%)

Mutual labels: etl, data-engineering

Just Dashboard

📊 📋 Dashboards using YAML or JSON files

Stars: ✭ 1,511 (+138.7%)

Mutual labels: data-science, data-engineering

Python Bigdata

Data science and Big Data with Python

Stars: ✭ 112 (-82.31%)

Mutual labels: data-science, spark

Accelerator

The Accelerator is a tool for fast and reproducible processing of large amounts of data.

Stars: ✭ 137 (-78.36%)

Mutual labels: data-science, data-engineering

Pipelinex

PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more

Stars: ✭ 127 (-79.94%)

Mutual labels: data-science, data-engineering

Wedatasphere

WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!

Stars: ✭ 372 (-41.23%)

Mutual labels: spark, etl

Metorikku

A simplified, lightweight ETL Framework based on Apache Spark

Stars: ✭ 361 (-42.97%)

Mutual labels: spark, etl

Datacleaner

The premier open source Data Quality solution

Stars: ✭ 391 (-38.23%)

Mutual labels: data-science, etl

Elastic

R client for the Elasticsearch HTTP API

Stars: ✭ 227 (-64.14%)

Mutual labels: data-science, etl

Gspread Pandas

A package to easily open an instance of a Google spreadsheet and interact with worksheets through Pandas DataFrames.

Stars: ✭ 226 (-64.3%)

Mutual labels: data-science, data-engineering

Koalas

Koalas: pandas API on Apache Spark

Stars: ✭ 3,044 (+380.88%)

Mutual labels: data-science, spark

Cql

Categorical Query Language IDE

Stars: ✭ 196 (-69.04%)

Mutual labels: data-science, etl

soda-spark

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Stars: ✭ 58 (-90.84%)

Mutual labels: pyspark, data-engineering

etl

[READ-ONLY] PHP - ETL (Extract Transform Load) data processing library

Stars: ✭ 279 (-55.92%)

Mutual labels: etl, data-engineering

polygon-etl

ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub

Stars: ✭ 53 (-91.63%)

Mutual labels: etl, data-engineering

Soda Sql

Metric collection, data testing and monitoring for SQL accessible data

Stars: ✭ 173 (-72.67%)

Mutual labels: data-science, data-engineering

versatile-data-kit

Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.

Stars: ✭ 144 (-77.25%)

Mutual labels: etl, data-engineering

hamilton

A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.

Stars: ✭ 612 (-3.32%)

Mutual labels: etl, data-engineering

sparklanes

A lightweight data processing framework for Apache Spark

Stars: ✭ 17 (-97.31%)

Mutual labels: etl, pyspark

python mozetl

ETL jobs for Firefox Telemetry

Stars: ✭ 25 (-96.05%)

Mutual labels: etl, pyspark

data processing course

Some class materials for a data processing course using PySpark

Stars: ✭ 50 (-92.1%)

Mutual labels: spark, pyspark

ODSC India 2018

My presentation at ODSC India 2018 about Deep Learning with Apache Spark

Stars: ✭ 26 (-95.89%)

Mutual labels: spark, pyspark

kafka-compose

🎼 Docker compose files for various kafka stacks

Stars: ✭ 32 (-94.94%)

Mutual labels: spark, pyspark

spark-extension

A library that provides useful extensions to Apache Spark and PySpark.

Stars: ✭ 25 (-96.05%)

Mutual labels: spark, pyspark

beneath

Beneath is a serverless real-time data platform ⚡️

Stars: ✭ 65 (-89.73%)

Mutual labels: etl, data-engineering

Agile data code 2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Stars: ✭ 413 (-34.76%)

Mutual labels: data-science, spark

arthur-redshift-etl

ELT Code for your Data Warehouse

Stars: ✭ 22 (-96.52%)

Mutual labels: etl, data-engineering

Datavec

ETL Library for Machine Learning - data pipelines, data munging and wrangling

Stars: ✭ 272 (-57.03%)

Mutual labels: spark, etl

Around Dataengineering

A Data Engineering & Machine Learning Knowledge Hub

Stars: ✭ 257 (-59.4%)

Mutual labels: spark, data-engineering

Dagster

An orchestration platform for the development, production, and observation of data assets.

Stars: ✭ 4,099 (+547.55%)

Mutual labels: data-science, etl

Auptimizer

An automatic ML model optimization tool.

Stars: ✭ 166 (-73.78%)

Mutual labels: data-science, data-engineering

incubator-linkis

Stars: ✭ 2,459 (+288.47%)

Mutual labels: spark, pyspark

Learn Something Every Day

📝 A compilation of everything that I learn; Computer Science, Software Development, Engineering, Math, and Coding in General. Read the rendered results here ->

Stars: ✭ 362 (-42.81%)

Mutual labels: data-science, data-engineering

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Stars: ✭ 406 (-35.86%)

Mutual labels: spark, pyspark

Great expectations

Always know what to expect from your data.

Stars: ✭ 5,808 (+817.54%)

Mutual labels: data-science, data-engineering

Heamy

A set of useful tools for competitive data science.

Stars: ✭ 511 (-19.27%)

Mutual labels: data-science

Pygam

[HELP REQUESTED] Generalized Additive Models in Python