Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → pisa-engine → Pisa

pisa-engine / Pisa

Licence: apache-2.0

PISA: Performant Indexes and Search for Academia

Labels

search search-engine information-retrieval

Projects that are alternatives of or similar to Pisa

Apache Lucene and Solr open-source search software

Stars: ✭ 4,217 (+762.37%)

Mutual labels: search, search-engine, information-retrieval

🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.

Stars: ✭ 3,409 (+597.14%)

Mutual labels: search, search-engine, information-retrieval

Rated Ranking Evaluator

Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures

Stars: ✭ 134 (-72.6%)

Mutual labels: search, search-engine, information-retrieval

Hardware-accelerated vector-based search engine. Available as a HTTP service or as an embedded library.

Stars: ✭ 529 (+8.18%)

Mutual labels: search, search-engine, information-retrieval

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Stars: ✭ 362 (-25.97%)

Mutual labels: search, search-engine, information-retrieval

A math-aware search engine.

Stars: ✭ 278 (-43.15%)

Mutual labels: search-engine, information-retrieval

A full-text search engine in rust

Stars: ✭ 3,373 (+589.78%)

Mutual labels: search, search-engine

A signature-based search engine

Stars: ✭ 313 (-35.99%)

Mutual labels: search, search-engine

Hexo Generator Search

A plugin to generate search data for Hexo.

Stars: ✭ 318 (-34.97%)

Mutual labels: search, search-engine

Xapiand: A RESTful Search Engine

Stars: ✭ 347 (-29.04%)

Mutual labels: search, search-engine

High-performance log search engine.

Stars: ✭ 356 (-27.2%)

Mutual labels: search, search-engine

Your 🔵 Superintelligence

Stars: ✭ 270 (-44.79%)

Mutual labels: search, search-engine

A query and indexing engine for Redis, providing secondary indexing, full-text search, and aggregations.

Stars: ✭ 3,393 (+593.87%)

Mutual labels: search, search-engine

Alfred 3 workflow to search for npm packages with npms.io

Stars: ✭ 312 (-36.2%)

Mutual labels: search, search-engine

Searchcode Server

The offical home of searchcode-server where you can run searchcode locally. Note that master is generally unstable in the sense that it is not a release. Check releases for release versions https://github.com/boyter/searchcode-server/releases

Stars: ✭ 262 (-46.42%)

Mutual labels: search, search-engine

🔍 Search engine for hidden material. Scraping dark web onions, irc logs, deep web etc...

Stars: ✭ 260 (-46.83%)

Mutual labels: search, search-engine

A search engine that doesn't track you.

Stars: ✭ 364 (-25.56%)

Mutual labels: search, search-engine

Awesome Search - this is all about the (e-commerce) search and its awesomeness

Stars: ✭ 361 (-26.18%)

Mutual labels: search, search-engine

Maryam: Open-source Intelligence(OSINT) Framework

Stars: ✭ 371 (-24.13%)

Mutual labels: search, search-engine

Search Engine in Erlang

Stars: ✭ 27 (-94.48%)

Mutual labels: search-engine, information-retrieval

View All Similar Projects ➔

PISA: Performant Indexes and Search for Academia v0.8.2

Join us on Slack

Get in touch via Slack:

Overview

PISA is a text search engine able to run on large-scale collections of documents. It allows researchers to experiment with state-of-the-art techniques, allowing an ideal environment for rapid development.

Some features of PISA are listed below:

Written in C++ for performance;
Parsing, Indexing, and Sharding capabilities;
Many index compression methods implemented;
Many query processing algorithms implemented;
Implementation of document reordering;
Free and open-source with permissive license;

About PISA

What is PISA?

PISA is a text search engine, though the "PISA Project" is a set of tools that help experiment with indexing and query processing. Given a text collection, PISA can build an inverted index over this corpus, allowing the corpus to be searched. The inverted index, put simply, is an efficient data structure that represents the document corpus by storing a list of documents for each unique term (see here). At query time, PISA stores its index in main memory for rapid retrieval.

What does that all mean?

In very simple terms, PISA is a text search engine. Starting with a corpus of documents, for example, Wikipedia, PISA can build an inverted index which allows us to rapidly search the Wikipedia collection. At the most basic level, Boolean AND and OR queries are supported. Imagine we wanted to find all of the Wikipedia documents matching the query "oolong tea" - we could run a Boolean conjunction (oolong AND tea). We might instead be interested in finding documents containing either oolong or tea (or both), in which case we can run a Boolean disjunction (oolong OR tea).

Beyond simple Boolean matching, as discussed above, we can actually rank documents. Without going into details, documents are ranked by functions that assume the more rare a term is, the more important the word is. These rankers also assume that the more often a word appears in a document, the more likely the document is to be about that word. Finally, longer documents contain more words, and are therefore more likely to get higher scores than shorter documents, so normalization is conducted to ensure all documents are treated equally. The interested reader may wish to examine the TF/IDF Wikipedia article to learn more about this method of ranking.

Search vs pattern matching

This is not the same type of search grep for example. This is more closely related to the popular Lucene search engine, though we don't currently support as many query types as Lucene. As discussed previously, the main underlying data structure in PISA is the inverted index. The inverted index stores, for each term, a list of documents that contain that term. These lists can be rapidly traversed to find documents that match the query terms, and these documents can then be scored, ranked, and returned to the user.

Who should use PISA?

The primary use-case for PISA is to conduct experiments to further the understanding of the field of Information Retrieval (IR). Within the field of IR, there are various important research directions that are focused on, from improving results quality (effectiveness), to improving the scalability and efficiency of search systems. PISA is focused mostly on the scalability and efficiency side of IR research, and is why PISA stands for "Performant Indexes and Search for Academia". In short, PISA is a platform for developing new innovations in efficient search.

What if I just want to play with a search engine?

While PISA is focused on being a base for experimentation, it is also perfectly suitable for use as a simple general purpose indexing and search system.

What sort of scale can PISA handle?

PISA can handle large text collections. For example, PISA can easily index the ClueWeb09B or ClueWeb12B corpora, which each contain over 50 million web documents and close to 500 GiB of compressed textual data, resulting in indexes that are in the range of 10-40 GiB, depending on the compression codec used. These indexes, depending on some details, can be built from scratch in around 10-20 hours. In addition, larger collections can be handled via index sharding which breaks large corpora into smaller subsets (shards). We do note, however, that PISA is an in-memory system, which aims to serve queries based on an index residing entirely in main memory. As such, the possible scale will be limited by main memory.

How fast is PISA, really?

A few recent works have benchmarked PISA. For example, An Experimental Study of Index Compression and DAAT Query Processing Methods by Antonio Mallia, Michal Siedlaczek, and Torsten Suel, which appeared in ECIR 2019, showed PISA to be capable of returning the top 10 and top 1000 documents with an average latency in the range of 10-40 and 20-50 milliseconds respectively, on a collection containing 50 million web documents.

PISA is also included in Tantivy's search engine benchmark game, which also has Tantivy, Lucene, and Rucene as competitors.

How did PISA begin?

PISA is a fork of the ds2i project started by Giuseppe Ottaviano. The ds2i project contained the source code for a number of important efficiency innovations in IR, including the "Partitioned Elias-Fano" compression method.

Getting Started

For those interested in working with PISA, we suggest examining the following resources:

The Open Source Information Retrieval Replicability Challenge (OSIRRC) paper describing PISA, including end-to-end experimentation.
The accompanying Docker image which allows the experiments from above to be replicated.
The documentation.
Drop in to our Slack channel and say hi!

If you want to get involved with PISA, please check out our Contributing page.

Reference

If you use PISA in a research paper, please cite the following reference:

@inproceedings{MSMS2019,
  author    = {Antonio Mallia and Michal Siedlaczek and Joel Mackenzie and Torsten Suel},
  title     = {{PISA:} Performant Indexes and Search for Academia},
  booktitle = {Proceedings of the Open-Source {IR} Replicability Challenge co-located
               with 42nd International {ACM} {SIGIR} Conference on Research and Development
               in Information Retrieval, [email protected] 2019, Paris, France, July 25,
               2019.},
  pages     = {50--56},
  year      = {2019},
  url       = {http://ceur-ws.org/Vol-2409/docker08.pdf}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 489

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (80) 🔗