Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (-96.32%)

Mutual labels: apache-spark, pyspark

pyspark-asyncactions

Asynchronous actions for PySpark

Stars: ✭ 30 (-97.17%)

Mutual labels: apache-spark, pyspark

Pyspark Boilerplate

A boilerplate for writing PySpark Jobs

Stars: ✭ 318 (-70.03%)

Mutual labels: apache-spark, pyspark

Live log analyzer spark

Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.

Stars: ✭ 14 (-98.68%)

Mutual labels: apache-spark, pyspark

Pyspark Stubs

Apache (Py)Spark type annotations (stub files).

Stars: ✭ 98 (-90.76%)

Mutual labels: apache-spark, pyspark

isarn-sketches-spark

Routines and data structures for using isarn-sketches idiomatically in Apache Spark

Stars: ✭ 28 (-97.36%)

Mutual labels: apache-spark, pyspark

SynapseML

Simple and Distributed Machine Learning

Stars: ✭ 3,355 (+216.21%)

Mutual labels: apache-spark, pyspark

pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

Stars: ✭ 115 (-89.16%)

Mutual labels: apache-spark, pyspark

Spark Syntax

This is a repo documenting the best practices in PySpark.

Stars: ✭ 412 (-61.17%)

Mutual labels: pyspark

Sparkling Titanic

Training models with Apache Spark, PySpark for Titanic Kaggle competition

Stars: ✭ 12 (-98.87%)

Mutual labels: pyspark

Awesome Kafka

A list about Apache Kafka

Stars: ✭ 397 (-62.58%)

Mutual labels: apache-spark

Sparkmeasure

This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.

Stars: ✭ 368 (-65.32%)

Mutual labels: apache-spark

Optimus

🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Stars: ✭ 986 (-7.07%)

Mutual labels: pyspark

Pyspark Setup Demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

Stars: ✭ 24 (-97.74%)

Mutual labels: pyspark

Wirbelsturm

Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

Stars: ✭ 332 (-68.71%)

Mutual labels: apache-spark

Agile data code 2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Stars: ✭ 413 (-61.07%)

Mutual labels: apache-spark

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Stars: ✭ 406 (-61.73%)

Mutual labels: pyspark

Dblink

Distributed Bayesian Entity Resolution in Apache Spark

Stars: ✭ 38 (-96.42%)

Mutual labels: apache-spark

Spark Structured Streaming Book

The Internals of Spark Structured Streaming

Stars: ✭ 371 (-65.03%)

Mutual labels: apache-spark

Mobius

C# and F# language binding and extensions to Apache Spark

Stars: ✭ 929 (-12.44%)

Mutual labels: apache-spark

Spark As Service Using Embedded Server

This application comes as Spark2.1-as-Service-Provider using an embedded, Reactive-Streams-based, fully asynchronous HTTP server

Stars: ✭ 46 (-95.66%)

Mutual labels: apache-spark

Spark Tdd Example

A simple Spark TDD example

Stars: ✭ 23 (-97.83%)

Mutual labels: pyspark

Coolplayspark

酷玩 Spark: Spark 源代码解析、Spark 类库等

Stars: ✭ 3,318 (+212.72%)

Mutual labels: apache-spark

Learningsparkv2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

Stars: ✭ 307 (-71.07%)

Mutual labels: apache-spark

Real Time Stream Processing Engine

This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.

Stars: ✭ 37 (-96.51%)

Mutual labels: apache-spark

Cluster Pack

A library on top of either pex or conda-pack to make your Python code easily available on a cluster

Stars: ✭ 23 (-97.83%)

Mutual labels: pyspark

Mist

Serverless proxy for Spark cluster

Stars: ✭ 309 (-70.88%)

Mutual labels: apache-spark

Morpheus

Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.

Stars: ✭ 303 (-71.44%)

Mutual labels: apache-spark

Goodreads etl pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Stars: ✭ 793 (-25.26%)

Mutual labels: apache-spark

Spark Notebook

Interactive and Reactive Data Science using Scala and Spark.

Stars: ✭ 3,081 (+190.39%)

Mutual labels: apache-spark

Sparkflow

Easy to use library to bring Tensorflow on Apache Spark

Stars: ✭ 282 (-73.42%)

Mutual labels: apache-spark

Spark Sklearn

(Deprecated) Scikit-learn integration package for Apache Spark

Stars: ✭ 1,055 (-0.57%)

Mutual labels: apache-spark

Spark Scala Maven Example

Example Maven configuration for a Spark, Scala project

Stars: ✭ 45 (-95.76%)

Mutual labels: apache-spark

Cloud Based Sql Engine Using Spark

Cloud-based SQL engine using SPARK where data is accessible as JDBC/ODBC data source via Spark ThriftServer.

Stars: ✭ 30 (-97.17%)

Mutual labels: apache-spark

Sparklyr

R interface for Apache Spark

Stars: ✭ 775 (-26.96%)

Mutual labels: apache-spark

Parquet Dotnet

🏐 Apache Parquet for modern .NET

Stars: ✭ 276 (-73.99%)

Mutual labels: apache-spark

Tdigest

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

Stars: ✭ 274 (-74.18%)

Mutual labels: pyspark

Kafka Storm Starter

Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

Stars: ✭ 728 (-31.39%)

Mutual labels: apache-spark

Spark Jupyter Aws

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

Stars: ✭ 259 (-75.59%)

Mutual labels: apache-spark

spark-structured-streaming-examples

Spark structured streaming examples with using of version 3.0.0

Stars: ✭ 23 (-97.83%)

Mutual labels: apache-spark

Spark Flamegraph

Easy CPU Profiling for Apache Spark applications

Stars: ✭ 30 (-97.17%)

Mutual labels: apache-spark

Scriptis

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

Stars: ✭ 696 (-34.4%)

Mutual labels: pyspark

HAL-9000

Automatically setup a productive development environment with Ansible on macOS

Stars: ✭ 72 (-93.21%)

Mutual labels: apache-spark

Pyspark Example Project

Example project implementing best practices for PySpark ETL jobs and applications.

Stars: ✭ 633 (-40.34%)

Mutual labels: pyspark

basin

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

Stars: ✭ 25 (-97.64%)

Mutual labels: pyspark

data-algorithms-with-spark

O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian

Stars: ✭ 34 (-96.8%)

Mutual labels: pyspark

Spark Tda

SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.

Stars: ✭ 45 (-95.76%)

Mutual labels: apache-spark

1-60 of 200 similar projects

›