All Projects → zouzias → Spark Lucenerdd

zouzias / Spark Lucenerdd

Licence: apache-2.0
Spark RDD with Lucene's query and entity linkage capabilities

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Spark Lucenerdd

splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (+58.77%)
Mutual labels:  spark, deduplication
Roaringbitmap
A better compressed bitset in Java
Stars: ✭ 2,460 (+2057.89%)
Mutual labels:  spark, lucene
Elassandra
Elassandra = Elasticsearch + Apache Cassandra
Stars: ✭ 1,610 (+1312.28%)
Mutual labels:  spark, lucene
experiments
Code examples for my blog posts
Stars: ✭ 21 (-81.58%)
Mutual labels:  spark, lucene
Waterdrop
Production Ready Data Integration Product, documentation:
Stars: ✭ 1,856 (+1528.07%)
Mutual labels:  spark
Seldon Server
Machine Learning Platform and Recommendation Engine built on Kubernetes
Stars: ✭ 1,435 (+1158.77%)
Mutual labels:  spark
Spark On K8s Operator
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Stars: ✭ 1,780 (+1461.4%)
Mutual labels:  spark
Splash
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Stars: ✭ 105 (-7.89%)
Mutual labels:  spark
Spark Mllib Twitter Sentiment Analysis
🌟 ✨ Analyze and visualize Twitter Sentiment on a world map using Spark MLlib
Stars: ✭ 113 (-0.88%)
Mutual labels:  spark
Python Bigdata
Data science and Big Data with Python
Stars: ✭ 112 (-1.75%)
Mutual labels:  spark
Bigdataclass
Two-day workshop that covers how to use R to interact databases and Spark
Stars: ✭ 110 (-3.51%)
Mutual labels:  spark
Flink Learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Stars: ✭ 11,378 (+9880.7%)
Mutual labels:  spark
Lambda Arch
Applying Lambda Architecture with Spark, Kafka, and Cassandra.
Stars: ✭ 111 (-2.63%)
Mutual labels:  spark
Logigsk
A Linux based software package to control led's on Logitech G910, G810, G610 and G410.
Stars: ✭ 107 (-6.14%)
Mutual labels:  spark
Ik Analyzer
支持Lucene5/6/7/8+版本, 长期维护。
Stars: ✭ 112 (-1.75%)
Mutual labels:  lucene
Sparktutorial
Source code for James Lee's Aparch Spark with Java course
Stars: ✭ 105 (-7.89%)
Mutual labels:  spark
Parquet Index
Spark SQL index for Parquet tables
Stars: ✭ 109 (-4.39%)
Mutual labels:  spark
Archivespark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Stars: ✭ 111 (-2.63%)
Mutual labels:  spark
Distributed Dataset
A distributed data processing framework in Haskell.
Stars: ✭ 108 (-5.26%)
Mutual labels:  spark
Pyspark Cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
Stars: ✭ 108 (-5.26%)
Mutual labels:  spark

spark-lucenerdd

Master codecov Maven Javadocs Gitter

Spark RDD with Apache Lucene's query capabilities.

The main abstractions are special types of RDD called LuceneRDD, FacetedLuceneRDD and ShapeLuceneRDD, which instantiate a Lucene index on each Spark executor. These RDDs distribute search queries and aggregate search results between the Spark driver and its executors. Currently, the following queries are supported:

Operation Syntax Description
Term Query LuceneRDD.termQuery(field, query, topK) Exact term search
Fuzzy Query LuceneRDD.fuzzyQuery(field, query, maxEdits, topK) Fuzzy term search
Phrase Query LuceneRDD.phraseQuery(field, query, topK) Phrase search
Prefix Query LuceneRDD.prefixSearch(field, prefix, topK) Prefix search
Query Parser LuceneRDD.query(queryString, topK) Query parser search
Faceted Search FacetedLuceneRDD.facetQuery(queryString, field, topK) Faceted Search
Record Linkage LuceneRDD.link(otherEntity: RDD[T], linkageFct: T => searchQuery, topK) Record linkage via Lucene queries
Circle Search ShapeLuceneRDD.circleSearch((x,y), radius, topK) Search within radius
Bbox Search ShapeLuceneRDD.bboxSearch(lowerLeft, upperLeft, topK) Bounding box
Spatial Linkage ShapeLuceneRDD.linkByRadius(RDD[T], linkage: T => (x,y), radius, topK) Spatial radius linkage

Using the query parser, you can perform prefix queries, fuzzy queries, prefix queries, etc and any combination of those. For more information on using Lucene's query parser, see Query Parser.

Examples

Here are a few examples using LuceneRDD for full text search, spatial search and record linkage. All examples exploit Lucene's flexible query language. For spatial search, lucene-spatial and jts are required.

For more, check the wiki. More examples are available at examples and performance evaluation examples on AWS can be found here.

Presentations

For an overview of the library, check these ScalaIO 2016 Slides.

Linking

You can link against this library (for Spark 1.4+) in your program at the following coordinates:

Using SBT:

libraryDependencies += "org.zouzias" %% "spark-lucenerdd" % "0.3.9"

Using Maven:

<dependency>
    <groupId>org.zouzias</groupId>
    <artifactId>spark-lucenerdd_2.11</artifactId>
    <version>0.3.9</version>
</dependency>

This library can also be added to Spark jobs launched through spark-shell or spark-submit by using the --packages command line option. For example, to include it when starting the spark shell:

$ bin/spark-shell --packages org.zouzias:spark-lucenerdd_2.11:0.3.9

Unlike using --jars, using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.

Compatibility

The project has the following compatibility with Apache Spark:

Artifact Release Date Spark compatibility Notes Status
0.3.10-SNAPSHOT >= 3.x, JVM 8 develop Under Development
0.3.9 2020-11-30 >= 2.4.7, JVM 8 tag v.0.3.9 Released
0.3.7 2019-04-26 >= 2.4.2, JVM 8 tag v.0.3.7 Released
0.2.8 2017-05-30 2.1.x, JVM 7 tag v0.2.8 Released
0.1.0 2016-09-26 1.4.x, 1.5.x, 1.6.x tag v0.1.0 Cross-released with 2.10/2.11

Project Status and Limitations

Implicit conversions for the primitive types (Int, Float, Double, Long, String) are supported. Moreover, implicit conversions for all product types (i.e., tuples and case classes) of the above primitives are supported. Implicits for tuples default the field names to "_1", "_2", "_3, ... following Scala's naming conventions for tuples. In addition, implicits for most Spark DataFrame types are supported (MapType and boolean are missing).

Custom Case Classes

If you want to use your own custom class with LuceneRDD you can do it provided that your class member types are one of the primitive types (Int, Float, Double, Long, String).

For more details, see LuceneRDDCustomcaseClassImplicits under the tests directory.

Development

Docker

A docker compose script is setup with some preliminary notebook in Zeppelin, run

docker-compose up

For more LuceneRDD examples on Zeppelin, check these examples

Build from Source

Install Java, SBT and clone the project

git clone https://github.com/zouzias/spark-lucenerdd.git
cd spark-lucenerdd
sbt compile assembly

The above will create an assembly jar containing spark-lucenerdd functionality under target/scala-*/spark-lucenerdd-assembly-*.jar

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].