All Projects → lemire → IndexWikipedia

lemire / IndexWikipedia

Licence: Apache-2.0 license
A simple utility to index wikipedia dumps using Lucene.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to IndexWikipedia

explicit-semantic-analysis
Wikipedia-based Explicit Semantic Analysis, as described by Gabrilovich and Markovitch
Stars: ✭ 34 (+70%)
Mutual labels:  lucene, wikipedia-dump
lqt
Lucene Query Tool
Stars: ✭ 19 (-5%)
Mutual labels:  lucene
Fxdesktopsearch
A JavaFX based desktop search application.
Stars: ✭ 147 (+635%)
Mutual labels:  lucene
Clavin
CLAVIN (Cartographic Location And Vicinity INdexer) is an open source software package for document geoparsing and georesolution that employs context-based geographic entity resolution.
Stars: ✭ 237 (+1085%)
Mutual labels:  lucene
Eclipse Instasearch
Eclipse plug-in for fast code search
Stars: ✭ 165 (+725%)
Mutual labels:  lucene
cloud-note
无道云笔记,原生JSP的仿有道云笔记项目
Stars: ✭ 66 (+230%)
Mutual labels:  lucene
Elastiknn
Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
Stars: ✭ 139 (+595%)
Mutual labels:  lucene
LuceneTutorial
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Stars: ✭ 62 (+210%)
Mutual labels:  lucene
LogiEM
面向Elasticsearch研发与运维人员,围绕集群、索引构建的零侵入、多租户的Elasticsearch GUI管控平台
Stars: ✭ 209 (+945%)
Mutual labels:  lucene
Lucene
lucene技术细节
Stars: ✭ 233 (+1065%)
Mutual labels:  lucene
Examine
A .NET indexing and search engine powered by Lucene.Net
Stars: ✭ 208 (+940%)
Mutual labels:  lucene
Roaringbitmap
A better compressed bitset in Java
Stars: ✭ 2,460 (+12200%)
Mutual labels:  lucene
lucene-demo
基于lucene-5.5.4实现的全文检索demo
Stars: ✭ 70 (+250%)
Mutual labels:  lucene
Code4java
Repository for my java projects.
Stars: ✭ 164 (+720%)
Mutual labels:  lucene
Valley-eCommerce-prototype
An eCommerce website prototype with a layered architecture and MVC using Spring Boot v1.2, Spring Security, Hibernate, and Apache Lucene for full-text searching. for front-end: Bootstrap, Typeahead.js and Graph.js using Thymeleaf as RE.
Stars: ✭ 28 (+40%)
Mutual labels:  lucene
Everywhere
🔧 A tool can really search everywhere for you.
Stars: ✭ 147 (+635%)
Mutual labels:  lucene
Jblog
🔱一个简洁漂亮的java blog 👉基于Spring /MVC+ Hibernate + MySQL + Bootstrap + freemarker. 实现 🌈
Stars: ✭ 187 (+835%)
Mutual labels:  lucene
hermes
A library and microservice implementing the health and care terminology SNOMED CT with support for cross-maps, inference, fast full-text search, autocompletion, compositional grammar and the expression constraint language.
Stars: ✭ 131 (+555%)
Mutual labels:  lucene
luceneappengine
This project provides a directory useful to build Lucene and Google App Engine powered applications
Stars: ✭ 16 (-20%)
Mutual labels:  lucene
solr
Apache Solr open-source search software
Stars: ✭ 651 (+3155%)
Mutual labels:  lucene

IndexWikipedia

A simple utility to index wikipedia dumps using Lucene.

This tool can be used to quickly create an index. It is then expected that a programmer will write some code to use the index. This project does not aim to build an end-user index.

It is useful as part of research projects.

Usage:

  • install java (JDK) if needed
  • install maven if needed
  • grap your wikipedia dump: you might be grab quickly part of the dump by typing a command like wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2. (Sorry the database dumps are not at a fixed location so we cannot provide a precise URI.) Be mindful that there are many types of Wikipedia dumps and not all of them contain the articles: when in doubt, read the documentation.
  • mvn compile
  • Create a directory where your index will reside, such as WikipediaIndex. E.g., you might be able to type mkdir WikipediaIndex. Be mindful not to reuse the same directory for different projects or different Lucene versions.
  • mvn exec:java -Dexec.args="yourdump someoutputdirectory

Actual example:

git clone https://github.com/lemire/IndexWikipedia.git
cd IndexWikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
mkdir Index
mvn compile
mvn exec:java -Dexec.args="enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 Index"

Note that this precise example may fail unless you adjust the URI https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 since Wikipedia dumps are not guaranteed to stay at the same URI.

The documents have title, name, docid and body fields, all of which are stored with the index.

To see how you might then query the index, see the class file 'Query.java' for a working example.

Extracting word-frequency pairs

There is also a poorly named utility to extract all word-frequency pairs called me.lemire.lucene.CreateFreqSortedDictionary. Deliberately, it is currently undocumented.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].