Spark With PythonFundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-58.45%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-78.12%)
SparkApache Spark - A unified analytics engine for large-scale data processing
Stars: ✭ 31,618 (+8658.45%)
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-89.2%)
HydrographA visual ETL development and debugging tool for big data
Stars: ✭ 144 (-60.11%)
GeniA Clojure dataframe library that runs on Spark
Stars: ✭ 152 (-57.89%)
MahaA framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid.
Stars: ✭ 101 (-72.02%)
Mara Example Project 2An example mini data warehouse for python project stats, template for new projects
Stars: ✭ 154 (-57.34%)
SparklerSpark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (+0.28%)
ClickhouseClickHouse® is a free analytics DBMS for big data
Stars: ✭ 21,089 (+5741.83%)
Locopylocopy: Loading/Unloading to Redshift and Snowflake using Python.
Stars: ✭ 73 (-79.78%)
SaynData processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Stars: ✭ 79 (-78.12%)
QuicksqlA Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Stars: ✭ 1,821 (+404.43%)
Parquet IndexSpark SQL index for Parquet tables
Stars: ✭ 109 (-69.81%)
Presto Go ClientA Presto client for the Go programming language.
Stars: ✭ 183 (-49.31%)
Bulk WriterProvides guidance for fast ETL jobs, an IDataReader implementation for SqlBulkCopy (or the MySql or Oracle equivalents) that wraps an IEnumerable, and libraries for mapping entites to table columns.
Stars: ✭ 210 (-41.83%)
LinkisLinkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,323 (+543.49%)
link-moveA model-driven dynamically-configurable framework to acquire data from external sources and save it to your database.
Stars: ✭ 32 (-91.14%)
vixtractwww.vixtract.ru
Stars: ✭ 40 (-88.92%)
dislibThe Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.
Stars: ✭ 39 (-89.2%)
SylphStream computing platform for bigdata
Stars: ✭ 362 (+0.28%)
hamiltonA scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+69.53%)
DaFlowApache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-93.35%)
awesome-AI-kubernetes❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
Stars: ✭ 95 (-73.68%)
Kamu CliNext generation tool for decentralized exchange and transformation of semi-structured data
Stars: ✭ 69 (-80.89%)
Ether sqlA python library to push ethereum blockchain data into an sql database.
Stars: ✭ 41 (-88.64%)
cubetlCubETL - Framework and tool for data ETL (Extract, Transform and Load) in Python (PERSONAL PROJECT / SELDOM MAINTAINED)
Stars: ✭ 21 (-94.18%)
spark-acidACID Data Source for Apache Spark based on Hive ACID
Stars: ✭ 91 (-74.79%)
frovedisFramework of vectorized and distributed data analytics
Stars: ✭ 59 (-83.66%)
PrestoThe official home of the Presto distributed SQL query engine for big data
Stars: ✭ 12,957 (+3489.2%)
Ethereum EtlPython scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 956 (+164.82%)
XsqlUnified SQL Analytics Engine Based on SparkSQL
Stars: ✭ 176 (-51.25%)
Bitcoin EtlETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 174 (-51.8%)
CalciteApache Calcite
Stars: ✭ 2,816 (+680.06%)
Linq2dbLinq to database provider.
Stars: ✭ 2,211 (+512.47%)
DatafuseDatafuse is a free Cloud-Native Analytics DBMS(Inspired by ClickHouse) implemented in Rust
Stars: ✭ 327 (-9.42%)
DIRECTDIRECT, the Data Integration Run-time Execution Control Tool, is a data logistics framework that can be used to monitor, log, audit and control data integration / ETL processes.
Stars: ✭ 20 (-94.46%)
pyspark-algorithmsPySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (-80.06%)
nebulaA distributed block-based data storage and compute engine
Stars: ✭ 127 (-64.82%)
OpenKettleWebUI一款基于kettle的数据处理web调度控制平台,支持文档资源库和数据库资源库,通过web平台控制kettle数据转换,可作为中间件集成到现有系统中
Stars: ✭ 138 (-61.77%)
etlflowEtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.
Stars: ✭ 38 (-89.47%)
csvpluscsvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.
Stars: ✭ 67 (-81.44%)
DataBridge.NETConfigurable data bridge for permanent ETL jobs
Stars: ✭ 16 (-95.57%)
leaflet heatmap简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-96.4%)
qweryA SQL-like language for performing ETL transformations.
Stars: ✭ 28 (-92.24%)
bandar-logMonitoring tool to measure flow throughput of data sources and processing components that are part of Data Ingestion and ETL pipelines.
Stars: ✭ 20 (-94.46%)
basinBasin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (-93.07%)
bigdata-funA complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-96.12%)
DatavecETL Library for Machine Learning - data pipelines, data munging and wrangling
Stars: ✭ 272 (-24.65%)
SuccinctEnabling queries on compressed data.
Stars: ✭ 257 (-28.81%)
TrinoOfficial repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+1168.98%)
SmooksAn extensible Java framework for building XML and non-XML streaming applications
Stars: ✭ 293 (-18.84%)
PhoenixMirror of Apache Phoenix
Stars: ✭ 867 (+140.17%)
BETL-oldBETL. Meta data driven ETL generation using T-SQL
Stars: ✭ 17 (-95.29%)