Griffon VmGriffon Data Science Virtual Machine
Stars: ✭ 128 (+300%)
autThe Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+246.88%)
datalake-etl-pipelineSimplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+21.88%)
SparkrdmaRDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (+571.88%)
leaflet heatmap简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-59.37%)
Spark With PythonFundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+368.75%)
Bigdata PlaygroundA complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Stars: ✭ 177 (+453.13%)
CloudbreakA tool for provisioning and managing Apache Hadoop clusters in the cloud. Cloudbreak, as part of the Hortonworks Data Platform, makes it easy to provision, configure and elastically grow HDP clusters on cloud infrastructure. Cloudbreak can be used to provision Hadoop across cloud infrastructure providers including AWS, Azure, GCP and OpenStack.
Stars: ✭ 301 (+840.63%)
OrcApache ORC - the smallest, fastest columnar storage for Hadoop workloads
Stars: ✭ 389 (+1115.63%)
big dataA collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (+6.25%)
pyspark-cheatsheetPySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (+259.38%)
MmlsparkSimple and Distributed Machine Learning
Stars: ✭ 2,899 (+8959.38%)
TrinoOfficial repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+14215.63%)
spark-recordsBulletproof Apache Spark jobs with fast root cause analysis of failures.
Stars: ✭ 67 (+109.38%)
HiveApache Hive
Stars: ✭ 4,031 (+12496.88%)
MorpheusMorpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
Stars: ✭ 303 (+846.88%)
Hadoop For GeoeventArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
Stars: ✭ 5 (-84.37%)
GafferA large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+5031.25%)
gan deeplearning4jAutomatic feature engineering using Generative Adversarial Networks using Deeplearning4j and Apache Spark.
Stars: ✭ 19 (-40.62%)
HydrographA visual ETL development and debugging tool for big data
Stars: ✭ 144 (+350%)
big-data-liteSamples to the Oracle Big Data Lite VM
Stars: ✭ 41 (+28.13%)
clusterdockclusterdock is a framework for creating Docker-based container clusters
Stars: ✭ 26 (-18.75%)
ParquetviewerSimple windows desktop application for viewing & querying Apache Parquet files
Stars: ✭ 145 (+353.13%)
rastercuberastercube is a python library for big data analysis of georeferenced time series data (e.g. MODIS NDVI)
Stars: ✭ 15 (-53.12%)
mmtf-workshop-2018Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (+56.25%)
bigdata-funA complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-56.25%)
SynapseMLSimple and Distributed Machine Learning
Stars: ✭ 3,355 (+10384.38%)
OzoneScalable, redundant, and distributed object store for Apache Hadoop
Stars: ✭ 330 (+931.25%)
TezApache Tez
Stars: ✭ 313 (+878.13%)
IgniteApache Ignite
Stars: ✭ 4,027 (+12484.38%)
MistServerless proxy for Spark cluster
Stars: ✭ 309 (+865.63%)
H2o 3H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+17575%)
Data Science Ipython NotebooksData science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+68800%)
MoosefsMooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Stars: ✭ 1,025 (+3103.13%)
Hdfs ShellHDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS
Stars: ✭ 117 (+265.63%)
Data AcceleratorData Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+671.88%)
DrillApache Drill is a distributed MPP query layer for self describing data
Stars: ✭ 1,619 (+4959.38%)
Eel SdkBig Data Toolkit for the JVM
Stars: ✭ 140 (+337.5%)
dislibThe Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.
Stars: ✭ 39 (+21.88%)
AsakusafwAsakusa Framework
Stars: ✭ 114 (+256.25%)
CalciteApache Calcite
Stars: ✭ 2,816 (+8700%)
mmtf-sparkMethods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Stars: ✭ 20 (-37.5%)
learning-hadoop-and-sparkCompanion to Learning Hadoop and Learning Spark courses on Linked In Learning
Stars: ✭ 146 (+356.25%)
DaFlowApache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (-25%)
Dist KerasDistributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Stars: ✭ 613 (+1815.63%)
PrestoThe official home of the Presto distributed SQL query engine for big data
Stars: ✭ 12,957 (+40390.63%)
awesome-toolscurated list of awesome tools and libraries for specific domains
Stars: ✭ 31 (-3.12%)
iisInformation Inference Service of the OpenAIRE system
Stars: ✭ 16 (-50%)