All Projects → pisa-engine → Pisa

pisa-engine / Pisa

Licence: apache-2.0
PISA: Performant Indexes and Search for Academia

Projects that are alternatives of or similar to Pisa

Lucene Solr
Apache Lucene and Solr open-source search software
Stars: ✭ 4,217 (+762.37%)
Mutual labels:  search, search-engine, information-retrieval
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+597.14%)
Mutual labels:  search, search-engine, information-retrieval
Rated Ranking Evaluator
Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures
Stars: ✭ 134 (-72.6%)
Mutual labels:  search, search-engine, information-retrieval
Resin
Hardware-accelerated vector-based search engine. Available as a HTTP service or as an embedded library.
Stars: ✭ 529 (+8.18%)
Mutual labels:  search, search-engine, information-retrieval
Sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (-25.97%)
Mutual labels:  search, search-engine, information-retrieval
Search Engine
A math-aware search engine.
Stars: ✭ 278 (-43.15%)
Mutual labels:  search-engine, information-retrieval
Toshi
A full-text search engine in rust
Stars: ✭ 3,373 (+589.78%)
Mutual labels:  search, search-engine
Bitfunnel
A signature-based search engine
Stars: ✭ 313 (-35.99%)
Mutual labels:  search, search-engine
Hexo Generator Search
A plugin to generate search data for Hexo.
Stars: ✭ 318 (-34.97%)
Mutual labels:  search, search-engine
Xapiand
Xapiand: A RESTful Search Engine
Stars: ✭ 347 (-29.04%)
Mutual labels:  search, search-engine
Minsql
High-performance log search engine.
Stars: ✭ 356 (-27.2%)
Mutual labels:  search, search-engine
Go Cyber
Your 🔵 Superintelligence
Stars: ✭ 270 (-44.79%)
Mutual labels:  search, search-engine
Redisearch
A query and indexing engine for Redis, providing secondary indexing, full-text search, and aggregations.
Stars: ✭ 3,393 (+593.87%)
Mutual labels:  search, search-engine
Alfred Npms
Alfred 3 workflow to search for npm packages with npms.io
Stars: ✭ 312 (-36.2%)
Mutual labels:  search, search-engine
Searchcode Server
The offical home of searchcode-server where you can run searchcode locally. Note that master is generally unstable in the sense that it is not a release. Check releases for release versions https://github.com/boyter/searchcode-server/releases
Stars: ✭ 262 (-46.42%)
Mutual labels:  search, search-engine
Darksearch
🔍 Search engine for hidden material. Scraping dark web onions, irc logs, deep web etc...
Stars: ✭ 260 (-46.83%)
Mutual labels:  search, search-engine
Jivesearch
A search engine that doesn't track you.
Stars: ✭ 364 (-25.56%)
Mutual labels:  search, search-engine
Awesome Search
Awesome Search - this is all about the (e-commerce) search and its awesomeness
Stars: ✭ 361 (-26.18%)
Mutual labels:  search, search-engine
Maryam
Maryam: Open-source Intelligence(OSINT) Framework
Stars: ✭ 371 (-24.13%)
Mutual labels:  search, search-engine
see
Search Engine in Erlang
Stars: ✭ 27 (-94.48%)
Mutual labels:  search-engine, information-retrieval

PISA: Performant Indexes and Search for Academia v0.8.2

Build and test clang-tidy Documentation Status codecov

Codacy Badge GitHub issues GitHub forks GitHub stars PRs Welcome DOI

Join us on Slack

Get in touch via Slack: Slack

Overview

PISA is a text search engine able to run on large-scale collections of documents. It allows researchers to experiment with state-of-the-art techniques, allowing an ideal environment for rapid development.

Some features of PISA are listed below:

  • Written in C++ for performance;
  • Parsing, Indexing, and Sharding capabilities;
  • Many index compression methods implemented;
  • Many query processing algorithms implemented;
  • Implementation of document reordering;
  • Free and open-source with permissive license;

About PISA

What is PISA?

PISA is a text search engine, though the "PISA Project" is a set of tools that help experiment with indexing and query processing. Given a text collection, PISA can build an inverted index over this corpus, allowing the corpus to be searched. The inverted index, put simply, is an efficient data structure that represents the document corpus by storing a list of documents for each unique term (see here). At query time, PISA stores its index in main memory for rapid retrieval.

What does that all mean?

In very simple terms, PISA is a text search engine. Starting with a corpus of documents, for example, Wikipedia, PISA can build an inverted index which allows us to rapidly search the Wikipedia collection. At the most basic level, Boolean AND and OR queries are supported. Imagine we wanted to find all of the Wikipedia documents matching the query "oolong tea" - we could run a Boolean conjunction (oolong AND tea). We might instead be interested in finding documents containing either oolong or tea (or both), in which case we can run a Boolean disjunction (oolong OR tea).

Beyond simple Boolean matching, as discussed above, we can actually rank documents. Without going into details, documents are ranked by functions that assume the more rare a term is, the more important the word is. These rankers also assume that the more often a word appears in a document, the more likely the document is to be about that word. Finally, longer documents contain more words, and are therefore more likely to get higher scores than shorter documents, so normalization is conducted to ensure all documents are treated equally. The interested reader may wish to examine the TF/IDF Wikipedia article to learn more about this method of ranking.

Search vs pattern matching

This is not the same type of search grep for example. This is more closely related to the popular Lucene search engine, though we don't currently support as many query types as Lucene. As discussed previously, the main underlying data structure in PISA is the inverted index. The inverted index stores, for each term, a list of documents that contain that term. These lists can be rapidly traversed to find documents that match the query terms, and these documents can then be scored, ranked, and returned to the user.

Who should use PISA?

The primary use-case for PISA is to conduct experiments to further the understanding of the field of Information Retrieval (IR). Within the field of IR, there are various important research directions that are focused on, from improving results quality (effectiveness), to improving the scalability and efficiency of search systems. PISA is focused mostly on the scalability and efficiency side of IR research, and is why PISA stands for "Performant Indexes and Search for Academia". In short, PISA is a platform for developing new innovations in efficient search.

What if I just want to play with a search engine?

While PISA is focused on being a base for experimentation, it is also perfectly suitable for use as a simple general purpose indexing and search system.

What sort of scale can PISA handle?

PISA can handle large text collections. For example, PISA can easily index the ClueWeb09B or ClueWeb12B corpora, which each contain over 50 million web documents and close to 500 GiB of compressed textual data, resulting in indexes that are in the range of 10-40 GiB, depending on the compression codec used. These indexes, depending on some details, can be built from scratch in around 10-20 hours. In addition, larger collections can be handled via index sharding which breaks large corpora into smaller subsets (shards). We do note, however, that PISA is an in-memory system, which aims to serve queries based on an index residing entirely in main memory. As such, the possible scale will be limited by main memory.

How fast is PISA, really?

A few recent works have benchmarked PISA. For example, An Experimental Study of Index Compression and DAAT Query Processing Methods by Antonio Mallia, Michal Siedlaczek, and Torsten Suel, which appeared in ECIR 2019, showed PISA to be capable of returning the top 10 and top 1000 documents with an average latency in the range of 10-40 and 20-50 milliseconds respectively, on a collection containing 50 million web documents.

PISA is also included in Tantivy's search engine benchmark game, which also has Tantivy, Lucene, and Rucene as competitors.

How did PISA begin?

PISA is a fork of the ds2i project started by Giuseppe Ottaviano. The ds2i project contained the source code for a number of important efficiency innovations in IR, including the "Partitioned Elias-Fano" compression method.

Getting Started

For those interested in working with PISA, we suggest examining the following resources:

  1. The Open Source Information Retrieval Replicability Challenge (OSIRRC) paper describing PISA, including end-to-end experimentation.
  2. The accompanying Docker image which allows the experiments from above to be replicated.
  3. The documentation.
  4. Drop in to our Slack channel and say hi!

If you want to get involved with PISA, please check out our Contributing page.

Reference

If you use PISA in a research paper, please cite the following reference:

@inproceedings{MSMS2019,
  author    = {Antonio Mallia and Michal Siedlaczek and Joel Mackenzie and Torsten Suel},
  title     = {{PISA:} Performant Indexes and Search for Academia},
  booktitle = {Proceedings of the Open-Source {IR} Replicability Challenge co-located
               with 42nd International {ACM} {SIGIR} Conference on Research and Development
               in Information Retrieval, [email protected] 2019, Paris, France, July 25,
               2019.},
  pages     = {50--56},
  year      = {2019},
  url       = {http://ceur-ws.org/Vol-2409/docker08.pdf}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].