All Projects → ognis1205 → Spark Tda

ognis1205 / Spark Tda

Licence: apache-2.0
SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Spark Tda

Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (+6342.22%)
Mutual labels:  spark, ml, apache-spark
Sparkle
Haskell on Apache Spark.
Stars: ✭ 419 (+831.11%)
Mutual labels:  spark, apache-spark
Agile data code 2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (+817.78%)
Mutual labels:  spark, apache-spark
Spark Examples
Spark examples
Stars: ✭ 41 (-8.89%)
Mutual labels:  spark, apache-spark
Wirbelsturm
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
Stars: ✭ 332 (+637.78%)
Mutual labels:  spark, apache-spark
Sparkmeasure
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.
Stars: ✭ 368 (+717.78%)
Mutual labels:  spark, apache-spark
Sparklearning
Learning Apache spark,including code and data .Most part can run local.
Stars: ✭ 558 (+1140%)
Mutual labels:  spark, ml
Sk Dist
Distributed scikit-learn meta-estimators in PySpark
Stars: ✭ 260 (+477.78%)
Mutual labels:  spark, ml
Sparklyr
R interface for Apache Spark
Stars: ✭ 775 (+1622.22%)
Mutual labels:  spark, apache-spark
Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+1662.22%)
Mutual labels:  spark, apache-spark
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+1964.44%)
Mutual labels:  spark, apache-spark
Coolplayspark
酷玩 Spark: Spark 源代码解析、Spark 类库等
Stars: ✭ 3,318 (+7273.33%)
Mutual labels:  spark, apache-spark
Learningsparkv2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Stars: ✭ 307 (+582.22%)
Mutual labels:  spark, apache-spark
Spark Structured Streaming Book
The Internals of Spark Structured Streaming
Stars: ✭ 371 (+724.44%)
Mutual labels:  spark, apache-spark
Spark Notebook
Interactive and Reactive Data Science using Scala and Spark.
Stars: ✭ 3,081 (+6746.67%)
Mutual labels:  spark, apache-spark
Featran
A Scala feature transformation library for data science and machine learning
Stars: ✭ 420 (+833.33%)
Mutual labels:  spark, ml
Spark Flamegraph
Easy CPU Profiling for Apache Spark applications
Stars: ✭ 30 (-33.33%)
Mutual labels:  spark, apache-spark
spark-structured-streaming-examples
Spark structured streaming examples with using of version 3.0.0
Stars: ✭ 23 (-48.89%)
Mutual labels:  spark, apache-spark
Spark Jupyter Aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Stars: ✭ 259 (+475.56%)
Mutual labels:  spark, apache-spark
Kafka Storm Starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Stars: ✭ 728 (+1517.78%)
Mutual labels:  spark, apache-spark

SparkTDA

Build Status codecov.io Join the chat at https://gitter.im/ognis1205/spark-tda

The scalable topological data analysis package for Apache Spark. This project aims to implement the following features:

If you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.

Status

WIP and EXPERIMENTAL. This package is still a proof-of-concept of scalable topological data analysis support for Apache Spark, hence you cannot expect that this package is ready for production use.

Examples

Mapper

2-skeltons of Reeb Diagram of MNIST (40 intervals on the 1st primcipal component with 50% overlap) 2-skeltons of Reeb Diagram of MNIST (20 intervals on the 1st primcipal component with 50% overlap)
60k images clustered in 784 dimensions without any projection loss 60k images clustered in 784 dimensions witout any projection loss

Requirements

This library requires Spark 2.0+

Building and Running Unit Tests

To compile this project, run sbt package from the project home directory. This will also run the Scala unit tests. To run the unit tests, run sbt test from the project home directory. This project uses the sbt-spark-package plugin, which provides the 'spPublish' and 'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by supplying a comma-delimited list of Maven coordinates with --packages and download the package from the locally repository or official Spark Packages repository.

The package can be published locally with:

$ sbt spPublishLocal

The package can be published to Spark Packages with (requires authentication and authorization):

$ sbt spPublish

Using with Spark Shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11

Future Works

Mapper

  • [ ] Write Wiki
  • [ ] Implement Python APIs
  • [ ] Publish to Spark Packages
  • [ ] Benchmark
  • [ ] Consider using GraphFrames instead of plain GraphX
  • [ ] Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers

Related Softwares & Projects

  1. Python Mapper
  2. TDAMapper (R)
  3. Spark Mapper (Spark)
  4. KeplerMapper (Python with GUI)

References

Mapper

  1. G. Singh, F. Memoli, G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007.
  2. J. Curry (2013). Sheaves, Cosheaves and Applications, arXiv 2013
  3. T. K. Dey, F. Memoli, Y. Wang (2015), Mutiscale Mapper: A Framework for Topological Summarization of Data and Maps, arXiv 2015
  4. E. Munch, B. Wang (2015). Convergence between Categorical Representations of Reeb Space and Mapper, arXiv 2015
  5. E. Munch, B. Wang (2015). Reeb Space Approximation with Guarantees, The 25th Fall Workshop on Computational Geometry 2015.
  6. H. E. Kim (2015). Evaluating Ayasdi's Topological Data Analysis for Big Data, Master Thesis, Goethe University Frankfurt 2015.

KNN/ANN/SNN

  1. L. Ting, et al (2004). An investigation of practical approximate nearest neighbor algorithms, Advances in neural information processing systems. 2004.
  2. L. Ting, C. Rosenberg, H. Rowley (2007). Clustering billions of images with large scale nearest neighbor search. Applications of Computer Vision, 2007. WACV'07. IEEE Workshop on. IEEE, 2007.
  3. D. Ravichandran, P. Pantel, E. Hovy (2005). Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering, ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics pp 622-629
  4. M. Steinbach, L. Ertoez, V. Kumar (2004). The Challenges of Clustering High Dimensional Data, New Directions in Statistical Physics, pp 273-309
  5. L. Ertoez, M. Steinbach, Vipin Kumar (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Proceedings of the Third SIAM International Conference on Data Mining, 2003.
  6. M. E. Houle, H. P. Kriegel, P. Kroeger, E. S. A. Zimek (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?, Proceedings of the 22nd International Conference on Scientific and Statistical Database Management, 2010.

LSH

  1. M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms, 34th STOC, 2002.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].