Reddit sse streamA Server Side Event stream to deliver Reddit comments and submissions in near real-time to a client.
Optimus🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
AutocrawlerGoogle, Naver multiprocess image web crawler (Selenium)
Aws Auto Terminate Idle EmrAWS Auto Terminate Idle AWS EMR Clusters Framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
PantherDetect threats with log data and improve cloud security posture
Bigdata Interview🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
MobiusC# and F# language binding and extensions to Apache Spark
10 Weeks10-weeks of technology exploration
Hadoop For GeoeventArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
Kube BatchA batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
Coding Now学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等
GearpumpLightweight real-time big data streaming engine over Akka
Spark Movie LensAn on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
VaexOut-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀
BigartmFast topic modeling platform
CdsData syncing in golang for ClickHouse.
BigsliceA serverless cluster computing system for the Go programming language
TensorbaseTensorBase BE is building a high performance, cloud neutral bigdata warehouse for SMEs fully in Rust.
Circosjsd3 library to build circular graphs
CortxCORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.
SidekickHigh Performance HTTP Sidecar Load Balancer
JigsawJigsaw七巧板 provides a set of web components based on Angular5/8/9+. The main purpose of Jigsaw is to help the application developers to construct complex & intensive interacting & user friendly web pages. Jigsaw is supporting the development of all applications of Big Data Product of ZTE.
DatawaveDataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.
Api.rssRSS as RESTful. This service allows you to transform RSS feed into an awesome API.
DatafakerDatafaker is a large-scale test data and flow test data generation tool. Datafaker fakes data and inserts to varied data sources. 测试数据生成工具
Uproot3ROOT I/O in pure Python and NumPy.
SplineData Lineage Tracking And Visualization Solution
ArvadosAn open source platform for managing and analyzing biomedical big data
LdetoolCode generator for fast log file parsers
Big Data Rosetta CodeCode snippets for solving common big data problems in various platforms. Inspired by Rosetta Code
DetEditA graphical user interface for annotating and editing events detected in long-term acoustic monitoring data
jigsaw-seed这是组件库 Jigsaw-七巧板(https://github.com/rdkmaster/jigsaw) 的种子工程,建议所有新增的app都以这个工程作为种子开始构建。
leaflet heatmap简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
proteicStreaming and static data visualization for the modern web.
big dataA collection of tutorials on Hadoop, MapReduce, Spark, Docker
ETL-Starter-Kit📁 Extract, Transform, Load (ETL) 👷 refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.
bqvThe simplest tool to manage views of BigQuery.
vulknLove your Data. Love the Environment. Love VULKИ.
flokkrDocumentation placeholder and utilities for all the other containers.