Griffon VmGriffon Data Science Virtual Machine
Stars: ✭ 128 (+5.79%)
SynapseMLSimple and Distributed Machine Learning
Stars: ✭ 3,355 (+2672.73%)
sparkucxA high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer
Stars: ✭ 32 (-73.55%)
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-8.26%)
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (-67.77%)
Spark With PythonFundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+23.97%)
Data AcceleratorData Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+104.13%)
Bigdata PlaygroundA complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (+46.28%)
awesome-toolscurated list of awesome tools and libraries for specific domains
Stars: ✭ 31 (-74.38%)
MorpheusMorpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
Stars: ✭ 303 (+150.41%)
mmtf-workshop-2018Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (-58.68%)
leaflet heatmap简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-89.26%)
pyspark-cheatsheetPySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (-4.96%)
MmlsparkSimple and Distributed Machine Learning
Stars: ✭ 2,899 (+2295.87%)
gan deeplearning4jAutomatic feature engineering using Generative Adversarial Networks using Deeplearning4j and Apache Spark.
Stars: ✭ 19 (-84.3%)
SparkrdmaRDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (+77.69%)
HydrographA visual ETL development and debugging tool for big data
Stars: ✭ 144 (+19.01%)
ParquetviewerSimple windows desktop application for viewing & querying Apache Parquet files
Stars: ✭ 145 (+19.83%)
mmtf-sparkMethods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Stars: ✭ 20 (-83.47%)
spark-recordsBulletproof Apache Spark jobs with fast root cause analysis of failures.
Stars: ✭ 67 (-44.63%)
MistServerless proxy for Spark cluster
Stars: ✭ 309 (+155.37%)
Bitcoin Value Predictor[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin
Stars: ✭ 91 (-24.79%)
Spark On K8s OperatorKubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Stars: ✭ 1,780 (+1371.07%)
Pythondatarepo for code published on pythondata.com
Stars: ✭ 113 (-6.61%)
SplashSplash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Stars: ✭ 105 (-13.22%)
Docker SparkApache Spark docker image
Stars: ✭ 1,396 (+1053.72%)
CuesheetA framework for writing Spark 2.x applications in a pretty way
Stars: ✭ 86 (-28.93%)
Spark StatesCustom state store providers for Apache Spark
Stars: ✭ 83 (-31.4%)
CmakCMAK is a tool for managing Apache Kafka clusters
Stars: ✭ 10,544 (+8614.05%)
AmbariMirror of Apache Ambari
Stars: ✭ 1,576 (+1202.48%)
MahaA framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid.
Stars: ✭ 101 (-16.53%)
PanoptesA Global Scale Network Telemetry Ecosystem
Stars: ✭ 80 (-33.88%)
Uproot4ROOT I/O in pure Python and NumPy.
Stars: ✭ 80 (-33.88%)
VizukaExplore high-dimensional datasets and how your algo handles specific regions.
Stars: ✭ 100 (-17.36%)
IotdbApache IoTDB
Stars: ✭ 1,221 (+909.09%)
SetlA simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-34.71%)
GenieDistributed Big Data Orchestration Service
Stars: ✭ 1,544 (+1176.03%)
Graph samplingGraph Sampling is a python package containing various approaches which samples the original graph according to different sample sizes.
Stars: ✭ 99 (-18.18%)
MlflowOpen source platform for the machine learning lifecycle
Stars: ✭ 10,898 (+8906.61%)
CookbookThe Data Engineering Cookbook
Stars: ✭ 9,829 (+8023.14%)
Hdfs ShellHDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS
Stars: ✭ 117 (-3.31%)
Amazon S3 Find And ForgetAmazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Stars: ✭ 115 (-4.96%)
BigdataclassTwo-day workshop that covers how to use R to interact databases and Spark
Stars: ✭ 110 (-9.09%)
LabsResearch on distributed system
Stars: ✭ 73 (-39.67%)
BookkeeperApache Bookkeeper
Stars: ✭ 1,178 (+873.55%)
KuduMirror of Apache Kudu
Stars: ✭ 1,360 (+1023.97%)