SparkApache Spark is a fast, in-memory data processing engine with elegant and expressive development API's to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.This project will have sample programs for Spark in Scala language .
Stars: ✭ 55 (-95.04%)
ScriptisScriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Stars: ✭ 696 (-37.18%)
DaFlowApache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-97.83%)
mmtf-workshop-2018Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (-95.49%)
jupyterlab-sparkmonitorJupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
Stars: ✭ 78 (-92.96%)
PucketBucketing and partitioning system for Parquet
Stars: ✭ 29 (-97.38%)
spark-extensionA library that provides useful extensions to Apache Spark and PySpark.
Stars: ✭ 25 (-97.74%)
hadoop-etl-udfsThe Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL
Stars: ✭ 17 (-98.47%)
Spark SyntaxThis is a repo documenting the best practices in PySpark.
Stars: ✭ 412 (-62.82%)
Node ParquetNodeJS module to access apache parquet format files
Stars: ✭ 46 (-95.85%)
jobAnalytics and searchJobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (-97.74%)
columnifyMake record oriented data to columnar format.
Stars: ✭ 28 (-97.47%)
incubator-linkisLinkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,459 (+121.93%)
IcebergIceberg is a table format for large, slow-moving tabular data
Stars: ✭ 393 (-64.53%)
SparkoraPowerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (-95.4%)
HybridBackendEfficient training of deep recommenders on cloud.
Stars: ✭ 30 (-97.29%)
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-96.48%)
Sparkling TitanicTraining models with Apache Spark, PySpark for Titanic Kaggle competition
Stars: ✭ 12 (-98.92%)
OSCIOpen Source Contributor Index
Stars: ✭ 107 (-90.34%)
ChoetlETL Framework for .NET / c# (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)
Stars: ✭ 372 (-66.43%)
pyspark-algorithmsPySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (-93.5%)
Gcs ToolsGCS support for avro-tools, parquet-tools and protobuf
Stars: ✭ 57 (-94.86%)
centurionKotlin Bigdata Toolkit
Stars: ✭ 320 (-71.12%)
qsvCSVs sliced, diced & analyzed.
Stars: ✭ 438 (-60.47%)
jgit-spark-connectorjgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
Stars: ✭ 71 (-93.59%)
experimentsCode examples for my blog posts
Stars: ✭ 21 (-98.1%)
pyspark-cassandrapyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4
Stars: ✭ 70 (-93.68%)
openmrs-fhir-analyticsA collection of tools for extracting FHIR resources and analytics services on top of that data.
Stars: ✭ 55 (-95.04%)
big dataA collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-96.93%)
workshop-sparkCódigo para workshops Spark com ambiente de desenvolvimento em docker
Stars: ✭ 27 (-97.56%)
Morphl Community EditionMorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Stars: ✭ 253 (-77.17%)
GimelBig Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (-80.51%)
Optimus🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (-11.01%)
Spark PracticeApache Spark (PySpark) Practice on Real Data
Stars: ✭ 200 (-81.95%)
DataEngineeringThis repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (-95.76%)
Elasticsearch loaderA tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch
Stars: ✭ 300 (-72.92%)
LinkisLinkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,323 (+109.66%)
graphiqueGraphQL service for arrow tables and parquet data sets.
Stars: ✭ 28 (-97.47%)
sparklanesA lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-98.47%)
Rumble⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-94.77%)
Awesome SparkA curated list of awesome Apache Spark packages and resources.
Stars: ✭ 1,061 (-4.24%)
SparkmagicJupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (-13.9%)
Tdigestt-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
Stars: ✭ 274 (-75.27%)