All Projects → Clustering4Ever → Clustering4Ever

Clustering4Ever / Clustering4Ever

Licence: Apache-2.0 license
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Clustering4Ever

Moosefs
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Stars: ✭ 1,025 (+713.49%)
Mutual labels:  big-data, clustering, scalability
Hazelcast Cpp Client
Hazelcast IMDG C++ Client
Stars: ✭ 67 (-46.83%)
Mutual labels:  big-data, clustering, scalability
Hazelcast Python Client
Hazelcast IMDG Python Client
Stars: ✭ 92 (-26.98%)
Mutual labels:  big-data, clustering, scalability
Clustering-in-Python
Clustering methods in Machine Learning includes both theory and python code of each algorithm. Algorithms include K Mean, K Mode, Hierarchical, DB Scan and Gaussian Mixture Model GMM. Interview questions on clustering are also added in the end.
Stars: ✭ 27 (-78.57%)
Mutual labels:  clustering, clustering-algorithm, clustering-evaluation
Hazelcast Go Client
Hazelcast IMDG Go Client
Stars: ✭ 140 (+11.11%)
Mutual labels:  big-data, clustering, scalability
hazelcast-csharp-client
Hazelcast .NET Client
Stars: ✭ 98 (-22.22%)
Mutual labels:  big-data, clustering, scalability
Hdbscan
A high performance implementation of HDBSCAN clustering.
Stars: ✭ 2,032 (+1512.7%)
Mutual labels:  clustering, clustering-algorithm, clustering-evaluation
Hazelcast Nodejs Client
Hazelcast IMDG Node.js Client
Stars: ✭ 124 (-1.59%)
Mutual labels:  big-data, clustering, scalability
Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Stars: ✭ 71 (-43.65%)
Mutual labels:  big-data, bigdata
Uproot4
ROOT I/O in pure Python and NumPy.
Stars: ✭ 80 (-36.51%)
Mutual labels:  big-data, bigdata
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+961.9%)
Mutual labels:  big-data, bigdata
Countly Sdk Cordova
Countly Product Analytics SDK for Cordova, Icenium and Phonegap
Stars: ✭ 69 (-45.24%)
Mutual labels:  big-data, bigdata
Spark R Notebooks
R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 109 (-13.49%)
Mutual labels:  big-data, bigdata
Awesome Scalability
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Stars: ✭ 36,688 (+29017.46%)
Mutual labels:  big-data, scalability
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+8623.02%)
Mutual labels:  big-data, bigdata
Genie
Distributed Big Data Orchestration Service
Stars: ✭ 1,544 (+1125.4%)
Mutual labels:  big-data, bigdata
Tennis Crystal Ball
Ultimate Tennis Statistics and Tennis Crystal Ball - Tennis Big Data Analysis and Prediction
Stars: ✭ 107 (-15.08%)
Mutual labels:  big-data, bigdata
Big Data Study
🐳 big data study
Stars: ✭ 141 (+11.9%)
Mutual labels:  big-data, bigdata
twitter-archive-reader
Full featured TypeScript Twitter archive reader and browser
Stars: ✭ 43 (-65.87%)
Mutual labels:  big-data, bigdata
Spark Movie Lens
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Stars: ✭ 745 (+491.27%)
Mutual labels:  big-data, bigdata

Clustering 4️⃣ Ever Download Maven Central Binder

Welcome to Clustering4️⃣Ever, a Big Data Clustering Library gathering clustering, unsupervised algorithms, and quality indices. Don't hesitate to check our Wiki, ask questions or make recommendations in our Gitter.

API documentation

Include it in your project

Add following line in your build.sbt :

  • "org.clustering4ever" % "clustering4ever_2.11" % "0.11.0" to your libraryDependencies

Eventually add one of these resolvers :

  • resolvers += Resolver.bintrayRepo("clustering4ever", "C4E")
  • resolvers += "mvnrepository" at "http://mvnrepository.com/artifact/"

You can also take specifics parts (Core, ScalaClustering, ...) from Bintray or Maven.

Available algorithms

  • emphasized algorithms are in Scala.
  • bold algorithms are implemented in Spark.
  • They can be available in both versions

Clustering algorithms

  • Jenks Natural Breaks
  • Epsilon Proximity*
    • Scalar Epsilon Proximity*, Binary Epsilon Proximity*, Mixed Epsilon Proximity*, Any Object Epsilon Proximity*
  • K-Centers*
    • K-Means*, K-Modes*, K-Prototypes*, Any Object K-Centers*
  • Gaussian Mixture
  • Self Organizing Maps (Original project)
  • G-Stream (Original project)
  • PatchWork (Original project)
  • Random Local Area *
  • OPTICS *
  • Clusterwize
  • Tensor Biclustering algorithms (Original project)
    • Folding-Spectral, Unfolding-Spectral, Thresholding Sum Of Squared Trajectory Length, Thresholding Individuals Trajectory Length, Recursive Biclustering, Multiple Biclustering
  • Ant-Tree *
    • Continuous Ant-Tree, Binary Ant-Tree, Mixed Ant-Tree
  • DC-DPM (Original project) - Distributed Clustering based on Dirichlet Process Mixture
  • SG2Stream

Algorithm followed with a * can be executed by benchmarking classes.

Preprocessing

  • UMAP
  • Gradient Ascent (Mean-Shift related)
    • Scalar Gradient Ascent, Binary Gradient Ascent, Mixed Gradient Ascent, Any Object Gradient Ascent
  • Rough Set Features Selection

Quality Indices

You can realize manually your quality measures with dedicated class for local or distributed collection. Helpers ClustersIndicesAnalysisLocal and ClustersIndicesAnalysisDistributed allow you to test indices on multiple clustering at once.

  • Internal Indices
    • Davies Bouldin
    • Ball Hall
  • External Indices
    • Multiple Classification
      • Mutual Information, Normalized Mutual Information
      • Purity
      • Accuracy, Precision, Recall, fBeta, f1, RAND, ARAND, Matthews correlation coefficient, CzekanowskiDice, RogersTanimoto, FolkesMallows, Jaccard, Kulcztnski, McNemar, RusselRao, SokalSneath1, SokalSneath2
    • Binary Classification
      • Accuracy, Precision, Recall, fBeta, f1

Clustering benchmarking and analysis

Using classes ClusteringChainingLocal, BigDataClusteringChaining, DistributedClusteringChaining, and ChainingOneAlgorithm descendants you have the possibility to run multiple clustering algorithms respectively locally and parallel, in a sequentially distributed way, and parallel on a distributed system, locally and parallel, generate much vectorization of the data whilst keeping active information on each clustering including used vectorization, clustering model, clustering number and clustering arguments.

Classes ClustersIndicesAnalysisLocal and ClustersIndicesAnalysisDistributed are devoted for clustering indices analysis.

Classes ClustersAnalysisLocal and ClustersAnalysisDistributed will be used to describe obtained clustering in terms of distributions, proportions of categorical features...

Incoming soon (developped by our team)

Citation

If you publish material based on information obtained from this repository, then, in your acknowledgements, please note the assistance you received by using this community work. This will help others to obtain the same information and replicate your experiments, because having results is cool but being able to compare to others is better. Citation: @misc{C4E, url = “https://github.com/Clustering4Ever/Clustering4Ever“, institution = “Paris 13 University, LIPN UMR CNRS 7030”}

C4E-Notebooks examples

Basic usages of implemented algorithms are exposed with BeakerX and Jupyter notebook through binder ➡️ Binder.

They also can be downloaded directly from our Notebooks repository under different format as Jupyter or SparkNotebook.

Miscellaneous

Helper functions to generate Clusterizable collections

You can easily generate your collections with basic Clusterizable using helpers in org.clustering4ever.util.{ArrayAndSeqTowardGVectorImplicit, ScalaCollectionImplicits, SparkImplicits} or explore Clusterizable and EasyClusterizable for more advanced usages.

References

What data structures are recommended for best performances

ArrayBuffer or ParArray as vector containers are recommended for local applications, if data is bigger don't hesitate to pass to RDD.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].