cejaPySpark phonetic and string matching algorithms
Stars: ✭ 24 (-52.94%)
strutilGolang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+123.53%)
stringdistanceA fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (+17.65%)
tika-similarityTika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Stars: ✭ 92 (+80.39%)
stringosimString similarity functions, String distance's, Jaccard, Levenshtein, Hamming, Jaro-Winkler, Q-grams, N-grams, LCS - Longest Common Subsequence, Cosine similarity...
Stars: ✭ 47 (-7.84%)
Example SparkSpark, Spark Streaming and Spark SQL unit testing strategies
Stars: ✭ 205 (+301.96%)
DparkPython clone of Spark, a MapReduce alike framework in Python
Stars: ✭ 2,668 (+5131.37%)
Spark PracticeApache Spark (PySpark) Practice on Real Data
Stars: ✭ 200 (+292.16%)
AzuredatabricksbestpracticesVersion 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs
Stars: ✭ 186 (+264.71%)
Azure Event Hubs☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs
Stars: ✭ 233 (+356.86%)
Sparkstreaming💥 🚀 封装sparkstreaming动态调节batch time(有数据就执行计算);🚀 支持运行过程中增删topic;🚀 封装sparkstreaming 1.6 - kafka 010 用以支持 SSL。
Stars: ✭ 179 (+250.98%)
SparkrdmaRDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (+321.57%)
HyperspaceAn open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
Stars: ✭ 246 (+382.35%)
Text-SimilarityA text similarity computation using minhashing and Jaccard distance on reuters dataset
Stars: ✭ 15 (-70.59%)
ScannsA scalable nearest neighbor search library in Apache Spark
Stars: ✭ 190 (+272.55%)
RoaringbitmapA better compressed bitset in Java
Stars: ✭ 2,460 (+4723.53%)
SparkFirely's open source FHIR server
Stars: ✭ 174 (+241.18%)
Deeplearning4jSuite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learni…
Stars: ✭ 12,277 (+23972.55%)
GimelBig Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+323.53%)
Data AcceleratorData Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Stars: ✭ 247 (+384.31%)
Spark Knnk-Nearest Neighbors algorithm on Spark
Stars: ✭ 205 (+301.96%)
Neo4j Spark ConnectorNeo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
Stars: ✭ 245 (+380.39%)
MmlsparkSimple and Distributed Machine Learning
Stars: ✭ 2,899 (+5584.31%)
simetricString similarity metrics for Elixir
Stars: ✭ 59 (+15.69%)
BallistaDistributed compute platform implemented in Rust, and powered by Apache Arrow.
Stars: ✭ 2,274 (+4358.82%)
RecommendationsystemBook recommender system using collaborative filtering based on Spark
Stars: ✭ 244 (+378.43%)
Js SparkRealtime calculation distributed system. AKA distributed lodash
Stars: ✭ 187 (+266.67%)
eddieNo description or website provided.
Stars: ✭ 18 (-64.71%)
Kotlin Spark ApiThis projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Stars: ✭ 183 (+258.82%)
Hadoop Docker基于Docker构建的Hadoop开发测试环境,包含Hadoop,Hive,HBase,Spark
Stars: ✭ 238 (+366.67%)
GeopysparkGeoTrellis for PySpark
Stars: ✭ 167 (+227.45%)
Ruby SparkRuby wrapper for Apache Spark
Stars: ✭ 221 (+333.33%)
Big WhaleSpark、Flink等离线任务的调度以及实时任务的监控
Stars: ✭ 163 (+219.61%)
XsqlUnified SQL Analytics Engine Based on SparkSQL
Stars: ✭ 176 (+245.1%)
Kraps RpcA RPC framework leveraging Spark RPC module
Stars: ✭ 175 (+243.14%)
KoalasKoalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+5868.63%)
Spark NlpState of the Art Natural Language Processing
Stars: ✭ 2,518 (+4837.25%)
MydatascienceportfolioApplying Data Science and Machine Learning to Solve Real World Business Problems
Stars: ✭ 227 (+345.1%)
TransmogrifaiTransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Stars: ✭ 2,084 (+3986.27%)
edits.crEdit distance algorithms inc. Jaro, Damerau-Levenshtein, and Optimal Alignment
Stars: ✭ 16 (-68.63%)
Spark WorkshopApache Spark™ and Scala Workshops
Stars: ✭ 224 (+339.22%)
Every Single Day I TldrA daily digest of the articles or videos I've found interesting, that I want to share with you.
Stars: ✭ 249 (+388.24%)
visualize-data-with-pythonA Jupyter notebook using some standard techniques for data science and data engineering to analyze data for the 2017 flooding in Houston, TX.
Stars: ✭ 60 (+17.65%)
Sagemaker SparkA Spark library for Amazon SageMaker.
Stars: ✭ 219 (+329.41%)
Whylogs JavaProfile and monitor your ML data pipeline end-to-end
Stars: ✭ 164 (+221.57%)
Spark Fast TestsApache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Stars: ✭ 249 (+388.24%)
Spark ExcelA Spark plugin for reading Excel files via Apache POI
Stars: ✭ 216 (+323.53%)