Gcs ToolsGCS support for avro-tools, parquet-tools and protobuf
Stars: ✭ 57 (+137.5%)
RatatoolA tool for data sampling, data generation, and data diffing
Stars: ✭ 279 (+1062.5%)
Kafka Storm StarterCode examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Stars: ✭ 728 (+2933.33%)
EtlboxA lightweight ETL (extract, transform, load) library and data integration toolbox for .NET.
Stars: ✭ 203 (+745.83%)
Goodreads etl pipelineAn end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+3204.17%)
columnifyMake record oriented data to columnar format.
Stars: ✭ 28 (+16.67%)
GafferA large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+6741.67%)
ETL-Starter-Kit📁 Extract, Transform, Load (ETL) 👷 refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.
Stars: ✭ 21 (-12.5%)
Parquet4sRead and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
Stars: ✭ 125 (+420.83%)
Spark With PythonFundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+525%)
hadoopofficeHadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Stars: ✭ 56 (+133.33%)
BenderBender - Serverless ETL Framework
Stars: ✭ 171 (+612.5%)
leaflet heatmap简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-45.83%)
Griffon VmGriffon Data Science Virtual Machine
Stars: ✭ 128 (+433.33%)
dswarman open-source data management platform for knowledge workers (https://github.com/dswarm/dswarm-documentation/wiki)
Stars: ✭ 57 (+137.5%)
Elasticsearch loaderA tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch
Stars: ✭ 300 (+1150%)
Bigdata💎🔥大数据学习笔记
Stars: ✭ 488 (+1933.33%)
God Of Bigdata专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
Stars: ✭ 6,008 (+24933.33%)
Luigi WarehouseA luigi powered analytics / warehouse stack
Stars: ✭ 72 (+200%)
TILToday I Learned
Stars: ✭ 43 (+79.17%)
Wifi基于wifi抓取信息的大数据查询分析系统
Stars: ✭ 93 (+287.5%)
Hadoop cookbookCookbook to install Hadoop 2.0+ using Chef
Stars: ✭ 82 (+241.67%)
Ether sqlA python library to push ethereum blockchain data into an sql database.
Stars: ✭ 41 (+70.83%)
HadoopcryptoledgerHadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (+425%)
SparkrdmaRDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (+795.83%)
Ethereum EtlPython scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 956 (+3883.33%)
Etl with pythonETL with Python - Taught at DWH course 2017 (TAU)
Stars: ✭ 68 (+183.33%)
Etl.netMass processing data with a complete ETL for .net developers
Stars: ✭ 129 (+437.5%)
Metlmito ETL tool
Stars: ✭ 153 (+537.5%)
Dist KerasDistributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Stars: ✭ 613 (+2454.17%)
waspWASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Stars: ✭ 19 (-20.83%)
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+362.5%)
ParquetviewerSimple windows desktop application for viewing & querying Apache Parquet files
Stars: ✭ 145 (+504.17%)
DataX-srcDataX 是异构数据广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。
Stars: ✭ 21 (-12.5%)
learning-hadoop-and-sparkCompanion to Learning Hadoop and Learning Spark courses on Linked In Learning
Stars: ✭ 146 (+508.33%)
link-moveA model-driven dynamically-configurable framework to acquire data from external sources and save it to your database.
Stars: ✭ 32 (+33.33%)
smart-data-lakeSmart Automation Tool for building modern Data Lakes and Data Pipelines
Stars: ✭ 79 (+229.17%)
BETL-oldBETL. Meta data driven ETL generation using T-SQL
Stars: ✭ 17 (-29.17%)
Csv2dbThe CSV to database command line loader
Stars: ✭ 102 (+325%)
AirflowETLBlog post on ETL pipelines with Airflow
Stars: ✭ 20 (-16.67%)
openmrs-fhir-analyticsA collection of tools for extracting FHIR resources and analytics services on top of that data.
Stars: ✭ 55 (+129.17%)
parquet-extraA collection of Apache Parquet add-on modules
Stars: ✭ 30 (+25%)
dpkb大数据相关内容汇总,包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词:Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse
Stars: ✭ 123 (+412.5%)
dockerfilesMulti docker container images for main Big Data Tools. (Hadoop, Spark, Kafka, HBase, Cassandra, Zookeeper, Zeppelin, Drill, Flink, Hive, Hue, Mesos, ... )
Stars: ✭ 29 (+20.83%)
TransformalizeConfigurable Extract, Transform, and Load
Stars: ✭ 125 (+420.83%)
ButterfreeA tool for building feature stores.
Stars: ✭ 126 (+425%)
Hive Jdbc Uber JarHive JDBC "uber" or "standalone" jar based on the latest Apache Hive version
Stars: ✭ 188 (+683.33%)
Omniparseromniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
Stars: ✭ 148 (+516.67%)
hive to es同步Hive数据仓库数据到Elasticsearch的小工具
Stars: ✭ 21 (-12.5%)
the-apache-ignite-bookAll code samples, scripts and more in-depth examples for The Apache Ignite Book. Include Apache Ignite 2.6 or above
Stars: ✭ 65 (+170.83%)
darwinAvro Schema Evolution made easy
Stars: ✭ 26 (+8.33%)