Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (+34.48%)

Mutual labels: big-data, pyspark

soda-spark

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Stars: ✭ 58 (+100%)

Mutual labels: pyspark, data-quality

SynapseML

Simple and Distributed Machine Learning

Stars: ✭ 3,355 (+11468.97%)

Mutual labels: big-data, pyspark

siembol

An open-source, real-time Security Information & Event Management tool based on big data technologies, providing a scalable, advanced security analytics framework.

Stars: ✭ 153 (+427.59%)

Mutual labels: big-data

bigquery-kafka-connect

☁️ nodejs kafka connect connector for Google BigQuery

Stars: ✭ 17 (-41.38%)

Mutual labels: big-data

airavata-php-gateway

Mirror of Apache Airavata PHP Gateway

Stars: ✭ 15 (-48.28%)

Mutual labels: big-data

azure-big-data-starter

A boilerplate project for Azure Big Data PaaS services

Stars: ✭ 13 (-55.17%)

Mutual labels: big-data

MLBD

Materials for "Machine Learning on Big Data" course

Stars: ✭ 20 (-31.03%)

Mutual labels: big-data

phrase-at-scale

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

Stars: ✭ 115 (+296.55%)

Mutual labels: pyspark

beam-site

Apache Beam Site

Stars: ✭ 28 (-3.45%)

Mutual labels: big-data

ceja

PySpark phonetic and string matching algorithms

Stars: ✭ 24 (-17.24%)

Mutual labels: pyspark

LoL-Match-Prediction

Win probability predictions for League of Legends matches using neural networks

Stars: ✭ 34 (+17.24%)

Mutual labels: big-data

pyspark-ML-in-Colab

Pyspark in Google Colab: A simple machine learning (Linear Regression) model

Stars: ✭ 32 (+10.34%)

Mutual labels: pyspark

IoT-system-PLC-data-to-InfluxDB

This project aim is to provide free software to fetch data from plcs (Siemens S7-300/400/1200/1500) and store it. Used stack is completly opensource. I used InfluDB as data storage, so application principle is following Big Data paradigm.

Stars: ✭ 26 (-10.34%)

Mutual labels: big-data

versatile-data-kit

Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.

Stars: ✭ 144 (+396.55%)

Mutual labels: data-quality

Big-Data-Demo

基于Vue、three.js、echarts，数据可视化展示项目，包含三维模型导入交互、三维模型标注等功能

Stars: ✭ 146 (+403.45%)

Mutual labels: big-data

leila

Librería para la evaluación de calidad de datos, e interacción con el portal de datos.gov.co

Stars: ✭ 56 (+93.1%)

Mutual labels: data-quality

pyspark-for-data-processing

Code for my presentation: Using PySpark to Process Boat Loads of Data

Stars: ✭ 20 (-31.03%)

Mutual labels: pyspark

databricks-notebooks

Collection of Databricks and Jupyter Notebooks

Stars: ✭ 19 (-34.48%)

Mutual labels: pyspark

xcast

A High-Performance Data Science Toolkit for the Earth Sciences

Stars: ✭ 28 (-3.45%)

Mutual labels: big-data

rastercube

rastercube is a python library for big data analysis of georeferenced time series data (e.g. MODIS NDVI)

Stars: ✭ 15 (-48.28%)

Mutual labels: big-data

OnlineStatsBase.jl

Base types for OnlineStats.

Stars: ✭ 26 (-10.34%)

Mutual labels: big-data

python mozetl

ETL jobs for Firefox Telemetry

Stars: ✭ 25 (-13.79%)

Mutual labels: pyspark

jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

Stars: ✭ 78 (+168.97%)

Mutual labels: pyspark

CS Book

🔥 Latest computer science e-books。提供最新技术类电子书下载， “我无非就是想卷死各位，或者被各位卷死！”

Stars: ✭ 40 (+37.93%)

Mutual labels: big-data

osm-data-classification

Migrated to: https://gitlab.com/Oslandia/osm-data-classification

Stars: ✭ 23 (-20.69%)

Mutual labels: data-quality

spark-records

Bulletproof Apache Spark jobs with fast root cause analysis of failures.

Stars: ✭ 67 (+131.03%)

Mutual labels: big-data

arrow-datafusion

Apache Arrow DataFusion SQL Query Engine

Stars: ✭ 2,360 (+8037.93%)

Mutual labels: big-data

scarf

Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.

Stars: ✭ 54 (+86.21%)

Mutual labels: big-data

ByteSlice

"Byteslice: Pushing the envelop of main memory data processing with a new storage layout" (SIGMOD'15)

Stars: ✭ 24 (-17.24%)

Mutual labels: big-data

RemoteShuffleService

Celeborn provides an elastic and high-performance service for shuffle and spilled data.

Stars: ✭ 262 (+803.45%)

Mutual labels: big-data

pyspark-k8s-boilerplate

Boilerplate for PySpark on Cloud Kubernetes

Stars: ✭ 24 (-17.24%)

Mutual labels: pyspark

terraform-aws-kinesis-firehose

This code creates a Kinesis Firehose in AWS to send CloudWatch log data to S3.

Stars: ✭ 25 (-13.79%)

Mutual labels: big-data

classifai

🔥 One of the most comprehensive open-source data annotation platform.

Stars: ✭ 99 (+241.38%)

Mutual labels: big-data

Sparkora

Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟

Stars: ✭ 51 (+75.86%)

Mutual labels: pyspark

SGDLibrary

MATLAB/Octave library for stochastic optimization algorithms: Version 1.0.20

Stars: ✭ 165 (+468.97%)

Mutual labels: big-data

spark-root

Apache Spark Data Source for ROOT File Format

Stars: ✭ 28 (-3.45%)

Mutual labels: big-data

dxram

A distributed in-memory key-value storage for billions of small objects.

Stars: ✭ 25 (-13.79%)

Mutual labels: big-data

Movies-Analytics-in-Spark-and-Scala

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

Stars: ✭ 47 (+62.07%)

Mutual labels: big-data

insightedge

InsightEdge Core

Stars: ✭ 22 (-24.14%)

Mutual labels: big-data

nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability

Stars: ✭ 8,196 (+28162.07%)

Mutual labels: big-data

img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Stars: ✭ 1,173 (+3944.83%)

Mutual labels: big-data

cloudberry

Big Data Visualization

Stars: ✭ 89 (+206.9%)

Mutual labels: big-data

Real Time Social Media Mining

DevOps pipeline for Real Time Social/Web Mining

Stars: ✭ 22 (-24.14%)

Mutual labels: big-data

GDLibrary

Matlab library for gradient descent algorithms: Version 1.0.1

Stars: ✭ 50 (+72.41%)

Mutual labels: big-data

storm-ml

an online learning algorithm library for Storm

Stars: ✭ 18 (-37.93%)

Mutual labels: big-data

IATI.cloud

The open-source IATI datastore for IATI data with RESTful web API providing XML, JSON, CSV output. It extracts and parses IATI XML files referenced in the IATI Registry and powered by Apache Solr.

Stars: ✭ 35 (+20.69%)

Mutual labels: data-validation

incubator-liminal

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

Stars: ✭ 117 (+303.45%)

Mutual labels: big-data

objectiv-analytics

Powerful product analytics for data teams, with full control over data & models.

Stars: ✭ 399 (+1275.86%)

Mutual labels: data-validation

1-60 of 497 similar projects

›

next*5