Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.

Stars: ✭ 368 (+717.78%)

Mutual labels: spark, apache-spark

Sparklearning

Learning Apache spark,including code and data .Most part can run local.

Stars: ✭ 558 (+1140%)

Mutual labels: spark, ml

Sk Dist

Distributed scikit-learn meta-estimators in PySpark

Stars: ✭ 260 (+477.78%)

Mutual labels: spark, ml

Sparklyr

R interface for Apache Spark

Stars: ✭ 775 (+1622.22%)

Mutual labels: spark, apache-spark

Goodreads etl pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Stars: ✭ 793 (+1662.22%)

Mutual labels: spark, apache-spark

Mobius

C# and F# language binding and extensions to Apache Spark

Stars: ✭ 929 (+1964.44%)

Mutual labels: spark, apache-spark

Coolplayspark

酷玩 Spark: Spark 源代码解析、Spark 类库等

Stars: ✭ 3,318 (+7273.33%)

Mutual labels: spark, apache-spark

Learningsparkv2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

Stars: ✭ 307 (+582.22%)

Mutual labels: spark, apache-spark

Spark Structured Streaming Book

The Internals of Spark Structured Streaming

Stars: ✭ 371 (+724.44%)

Mutual labels: spark, apache-spark

Spark Notebook

Interactive and Reactive Data Science using Scala and Spark.

Stars: ✭ 3,081 (+6746.67%)

Mutual labels: spark, apache-spark

Featran

A Scala feature transformation library for data science and machine learning

Stars: ✭ 420 (+833.33%)

Mutual labels: spark, ml

Spark Flamegraph

Easy CPU Profiling for Apache Spark applications

Stars: ✭ 30 (-33.33%)

Mutual labels: spark, apache-spark

spark-structured-streaming-examples

Spark structured streaming examples with using of version 3.0.0

Stars: ✭ 23 (-48.89%)

Mutual labels: spark, apache-spark

Spark Jupyter Aws

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

Stars: ✭ 259 (+475.56%)

Mutual labels: spark, apache-spark

Kafka Storm Starter

Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

Stars: ✭ 728 (+1517.78%)

Mutual labels: spark, apache-spark

View All Similar Projects ➔

SparkTDA

The scalable topological data analysis package for Apache Spark. This project aims to implement the following features:

[x] Scalable Mapper Implemented as Reeb Diagrams, i.e., Reeb Cosheaves
[x] Scalable Mapper Implementation
[ ] Scalable Multiscale Mapper Implementation
[ ] Scalable Tower Computation for Multiscale Mapper
[ ] Scalable Persistent Homology Computation on Top of Apache Spark

If you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.

Status

WIP and EXPERIMENTAL. This package is still a proof-of-concept of scalable topological data analysis support for Apache Spark, hence you cannot expect that this package is ready for production use.

Examples

Mapper

2-skeltons of Reeb Diagram of MNIST (40 intervals on the 1st primcipal component with 50% overlap)	2-skeltons of Reeb Diagram of MNIST (20 intervals on the 1st primcipal component with 50% overlap)
60k images clustered in 784 dimensions without any projection loss	60k images clustered in 784 dimensions witout any projection loss

Requirements

This library requires Spark 2.0+

Building and Running Unit Tests

To compile this project, run sbt package from the project home directory. This will also run the Scala unit tests. To run the unit tests, run sbt test from the project home directory. This project uses the sbt-spark-package plugin, which provides the 'spPublish' and 'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by supplying a comma-delimited list of Maven coordinates with --packages and download the package from the locally repository or official Spark Packages repository.

The package can be published locally with:

$ sbt spPublishLocal

The package can be published to Spark Packages with (requires authentication and authorization):

$ sbt spPublish

Using with Spark Shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11

Future Works

Mapper

[ ] Write Wiki
[ ] Implement Python APIs
[ ] Publish to Spark Packages
[ ] Benchmark
[ ] Consider using GraphFrames instead of plain GraphX
[ ] Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers

Related Softwares & Projects

References

Mapper

KNN/ANN/SNN

LSH

M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms, 34th STOC, 2002.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 45

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗