All Projects → msarhan → lucene-arabic-analyzer

msarhan / lucene-arabic-analyzer

Licence: MIT license
Apache Lucene analyzer for Arabic language with root based stemmer.

Programming Languages

java
68154 projects - #9 most used programming language
shell
77523 projects
Dockerfile
14818 projects

Labels

Projects that are alternatives of or similar to lucene-arabic-analyzer

Roaringbitmap
A better compressed bitset in Java
Stars: ✭ 2,460 (+9011.11%)
Mutual labels:  lucene
lucene-demo
基于lucene-5.5.4实现的全文检索demo
Stars: ✭ 70 (+159.26%)
Mutual labels:  lucene
LuceneTutorial
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Stars: ✭ 62 (+129.63%)
Mutual labels:  lucene
Jblog
🔱一个简洁漂亮的java blog 👉基于Spring /MVC+ Hibernate + MySQL + Bootstrap + freemarker. 实现 🌈
Stars: ✭ 187 (+592.59%)
Mutual labels:  lucene
hermes
A library and microservice implementing the health and care terminology SNOMED CT with support for cross-maps, inference, fast full-text search, autocompletion, compositional grammar and the expression constraint language.
Stars: ✭ 131 (+385.19%)
Mutual labels:  lucene
LogiEM
面向Elasticsearch研发与运维人员,围绕集群、索引构建的零侵入、多租户的Elasticsearch GUI管控平台
Stars: ✭ 209 (+674.07%)
Mutual labels:  lucene
Code4java
Repository for my java projects.
Stars: ✭ 164 (+507.41%)
Mutual labels:  lucene
jease
Jease is a Java CMS framework based on Object Database
Stars: ✭ 25 (-7.41%)
Mutual labels:  lucene
cloud-note
无道云笔记,原生JSP的仿有道云笔记项目
Stars: ✭ 66 (+144.44%)
Mutual labels:  lucene
solr
Apache Solr open-source search software
Stars: ✭ 651 (+2311.11%)
Mutual labels:  lucene
Examine
A .NET indexing and search engine powered by Lucene.Net
Stars: ✭ 208 (+670.37%)
Mutual labels:  lucene
Clavin
CLAVIN (Cartographic Location And Vicinity INdexer) is an open source software package for document geoparsing and georesolution that employs context-based geographic entity resolution.
Stars: ✭ 237 (+777.78%)
Mutual labels:  lucene
lqt
Lucene Query Tool
Stars: ✭ 19 (-29.63%)
Mutual labels:  lucene
Smartstorenet
Open Source ASP.NET MVC Enterprise eCommerce Shopping Cart Solution
Stars: ✭ 2,363 (+8651.85%)
Mutual labels:  lucene
luceneappengine
This project provides a directory useful to build Lucene and Google App Engine powered applications
Stars: ✭ 16 (-40.74%)
Mutual labels:  lucene
Eclipse Instasearch
Eclipse plug-in for fast code search
Stars: ✭ 165 (+511.11%)
Mutual labels:  lucene
RedisDirectory
🔒 A simple redis storage engine for lucene - 基于Redis的Lucene索引存储引擎 - Star me if you like it!
Stars: ✭ 18 (-33.33%)
Mutual labels:  lucene
lucene-postings-format
At-a-glance overview diagrams of Apache Lucene's default PostingsFormat (inverted index binary format).
Stars: ✭ 65 (+140.74%)
Mutual labels:  lucene
IndexWikipedia
A simple utility to index wikipedia dumps using Lucene.
Stars: ✭ 20 (-25.93%)
Mutual labels:  lucene
Valley-eCommerce-prototype
An eCommerce website prototype with a layered architecture and MVC using Spring Boot v1.2, Spring Security, Hibernate, and Apache Lucene for full-text searching. for front-end: Bootstrap, Typeahead.js and Graph.js using Thymeleaf as RE.
Stars: ✭ 28 (+3.7%)
Mutual labels:  lucene

Build Status Javadoc License Maven Central Apache Lucene

lucene-arabic-analyzer

Apache Lucene analyzer for Arabic language with root based stemmer.

Introduction

Stemming algorithms are used in information retrieval systems, text classifiers, indexers and text mining to extract roots of different words, so that words derived from the same stem or root are grouped together.

ArabicRootExtractorAnalyzer is responsible to do the following:

  1. Normalize input text by removing diacritics: e.g. "الْعَالَمِينَ" will be converted to "العالمين".
  2. Extract word's root: e.g. "العالمين" will be converted to "علم".

This way, documents will be indexed depending on its words roots, so, when you want to search in the index, you can input "علم" or "عالم" to get all documents containing "الْعَالَمِينَ".

Installation

Maven

<dependency>
  <groupId>com.github.msarhan</groupId>
  <artifactId>lucene-arabic-analyzer</artifactId>
  <version>[VERSION]</version>
</dependency>

Usage

//Initialize the index
Directory index = new RAMDirectory();
Analyzer analyzer = new ArabicRootExtractorAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, config);

Document doc = new Document();
doc.add(new StringField("number", "1", Field.Store.YES));
doc.add(new TextField("title", "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ", Field.Store.YES));
writer.addDocument(doc);

doc = new Document();
doc.add(new StringField("number", "2", Field.Store.YES));
doc.add(new TextField("title", "الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ", Field.Store.YES));
writer.addDocument(doc);

doc = new Document();
doc.add(new StringField("number", "3", Field.Store.YES));
doc.add(new TextField("title", "الرَّحْمَنِ الرَّحِيمِ", Field.Store.YES));
writer.addDocument(doc);
writer.close();
//~

//Query the index
String queryStr = "راحم";
Query query = new QueryParser("title", analyzer)
    .parse(queryStr);

int hitsPerPage = 5;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(query, hitsPerPage, Sort.INDEXORDER);

ScoreDoc[] hits = docs.scoreDocs;
//~

//Print results
System.out.println("Found " + hits.length + " hits:");
for (ScoreDoc hit : hits) {
    int docId = hit.doc;
    Document d = searcher.doc(docId);
    System.out.printf("\t(%s): %s\n", d.get("number"), d.get("title"));
}
//~

Usage of ArabicRootExtractorStemmer

ArabicRootExtractorStemmer stemmer = new ArabicRootExtractorStemmer();

assertTrue(stemmer.stem("الرَّحْمَنِ").stream().anyMatch(s -> s.equals("رحم")));
assertTrue(stemmer.stem("الْعَالَمِينَ").stream().anyMatch(s -> s.equals("علم")));
assertTrue(stemmer.stem("الْمُؤْمِنِينَ").stream().anyMatch(s -> s.equals("ءمن")));
assertTrue(stemmer.stem("يَتَنَازَعُونَ").stream().anyMatch(s -> s.equals("نزع")));

Integration with Elasticsearch

To use this Analyzer with Elasticsearch, use elasticsearch-arabic-analyzer plugin.

Building

# Install AlKhalil jar files in your local maven repository
cd alkhalil && ./maven-install.sh

# The resulting jar file will include Alkhalil dependencies
mvn package
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].