All Projects → parpalak → Rose

parpalak / Rose

Licence: mit
Simple PHP search engine that supports Russian and English morphology

Projects that are alternatives of or similar to Rose

Vscode Es7 Javascript React Snippets
Extension for Javascript/React snippets with search supporting ES7 and babel features
Stars: ✭ 435 (+291.89%)
Mutual labels:  search, snippets
Lh Cpp
C&C++ ftplugins suite for Vim
Stars: ✭ 108 (-2.7%)
Mutual labels:  snippets
Node Sonic Channel
🦉 Sonic Channel integration for Node. Used in pair with Sonic, the fast, lightweight and schema-less search backend.
Stars: ✭ 101 (-9.01%)
Mutual labels:  search
Ds2i
A library of inverted index data structures
Stars: ✭ 104 (-6.31%)
Mutual labels:  search
Redux Search
Redux bindings for client-side search
Stars: ✭ 1,377 (+1140.54%)
Mutual labels:  search
Glsnip
copy and paste across machines
Stars: ✭ 107 (-3.6%)
Mutual labels:  snippets
Simpleaudioindexer
Searching for the occurrence seconds of words/phrases or arbitrary regex patterns within audio files
Stars: ✭ 100 (-9.91%)
Mutual labels:  search
Datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble
Stars: ✭ 1,635 (+1372.97%)
Mutual labels:  search
Regex Snippets
Organized list of useful RegEx snippets
Stars: ✭ 109 (-1.8%)
Mutual labels:  snippets
Ghost Search
A simple but powerful search library for Ghost Blogging Platform.
Stars: ✭ 104 (-6.31%)
Mutual labels:  search
Sublime Robot Framework Assistant
Robot Framework plugin for Sublime Text3
Stars: ✭ 103 (-7.21%)
Mutual labels:  snippets
Cloudboost
Realtime JavaScript Backend.
Stars: ✭ 1,378 (+1141.44%)
Mutual labels:  search
Java
All Algorithms implemented in Java
Stars: ✭ 42,893 (+38542.34%)
Mutual labels:  search
Search
PHP search-systems made possible
Stars: ✭ 101 (-9.01%)
Mutual labels:  search
Unitylibrary
📚 Library of all kind of scripts, snippets & shaders for Unity
Stars: ✭ 1,968 (+1672.97%)
Mutual labels:  snippets
Atom Turbo Javascript
Commands and snippets for faster Javascript and Typescript with the Atom Editor
Stars: ✭ 100 (-9.91%)
Mutual labels:  snippets
Dotfiles
This is a mirror from https://gitlab.com/andreyorst/dotfiles
Stars: ✭ 103 (-7.21%)
Mutual labels:  snippets
Sketchup Ruby Api Tutorials
SketchUp Ruby API Tutorials and Examples
Stars: ✭ 105 (-5.41%)
Mutual labels:  snippets
Laravel 5 Snippets
Laravel 5 Snippets for Sublime Text
Stars: ✭ 110 (-0.9%)
Mutual labels:  snippets
Go Sonic
Sonic driver written in Go.
Stars: ✭ 110 (-0.9%)
Mutual labels:  search

Rose

This is a simple search engine for content sites with simplified English and Russian morphology support. It indexes your content and provides a full-text search.

Requirements

  1. PHP 5.6 or later. Build Status
  2. A relational database (MySQL is supported for now) in case of significant content size.

Installation

composer require s2/rose

If you do not use composer, download the archive, unpack it somewhere and ensure including php-files from src/ directory based on a PSR-0/4 scheme. Though you really should use composer.

Usage

Preparing Storage

The index can be stored in a database or in a file. Storage is an abstraction layer that hides implementation details. In most cases you gonna need database storage PdoStorage.

Both indexing and searching require the storage.

$pdo = new \PDO('mysql:host=127.0.0.1;dbname=s2_rose_test;charset=utf8', 'username', 'passwd');
$pdo->setAttribute(\PDO::ATTR_ERRMODE, \PDO::ERRMODE_EXCEPTION);

use S2\Rose\Storage\Database\PdoStorage;

$storage = new PdoStorage($pdo, 'table_prefix_');

When you want to rebuild the index, you call PdoStorage::erase() method:

$storage->erase();

It drops index tables (if exist) and creates new ones from scratch. This method will be enough to upgrade to a new version of Rose that breaks down the backward compatibility of the index.

Morphology

For natural language processing, Rose uses stemmers. Stemmer cuts off the changing part of words and Rose deals with stems. It has no built-in dictionaries but contains heuristic stemmers developed by Porter. You can integrate any other algorithm by implementing the StemmerInterface.

use S2\Rose\Stemmer\PorterStemmerEnglish;
use S2\Rose\Stemmer\PorterStemmerRussian;

// For optimization primary language goes first (in this case Russian)
$stemmer = new PorterStemmerRussian(new PorterStemmerEnglish());

Indexing

Indexer builds the search index. It depends on a stemmer and a storage.

use S2\Rose\Indexer;

$indexer = new Indexer($storage, $stemmer);

Indexer accepts your data in a special format. The data must be wrapped in the Indexable class:

use S2\Rose\Entity\Indexable;

// required params
$indexable = new Indexable(
	'id_1',            // External ID - an identifier in your system 
	'Test page title', // Title 
	'This is the first page to be indexed. I have to make up a content.',
	1                  // Instance ID - an optional ID of your subsystem 
);

// optional params
$indexable
	->setKeywords('singlekeyword, multiple keywords')       // The same as Meta Keywords
	->setDescription('Description can be used in snippets') // The same as Meta Description
	->setDate(new \DateTime('2016-08-24 00:00:00'))
	->setUrl('url1')
;

$indexer->index($indexable);

$indexable = new Indexable(
	'id_2',
	'Test page title 2',
	'This is the second page to be indexed. Let\'s compose something new.'
);
$indexable->setKeywords('content, page');

$indexer->index($indexable);

The constructor of Indexable requires 4 arguments:

  • external ID - an arbitrary string ID that is sufficient for your code to identify the page;
  • page title;
  • page content;
  • instance ID - an optional int ID of the page source (e.g. for multi-site services).

You may also provide some optional parameters: keywords, description, date and URL. Keywords affect the relevance. The description can be used for building a snippet (see below). It's a good idea to use the content of "keyword" and "description" meta-tags for this purpose (if you have any, of course). The URL can be an arbitrary string.

The Indexer::index() method is used both for adding and updating the index. If the content is not changed, this method skips the job. Otherwise, the content is being removed and indexed again.

When you remove a page from the site, just call

$indexer->removeById($externalId, $instanceId);

Searching

Full-text search results can be obtained via Finder class. $resultSet->getItems() returns all the information about content items and their relevance.

use S2\Rose\Finder;
use S2\Rose\Entity\Query;

$finder    = new Finder($storage, $stemmer);
$resultSet = $finder->find(new Query('content'));

foreach ($resultSet->getItems() as $item) {
	                         // first iteration:          second iteration:
	$item->getId();          // 'id_2'                    'id_1'
	$item->getInstanceId();  // null                      1
	$item->getTitle();       // 'Test page title 2'       'Test page title'
	$item->getUrl();         // ''                        'url1'
	$item->getDescription(); // ''                        'Description can be used in snippets'
	$item->getDate();        // null                      new \DateTime('2016-08-24 00:00:00')
	$item->getRelevance();   // 31.0                      1.0
	$item->getSnippet();     // ''                        'Description can be used in snippets'
}

Modify the Query object to use a pagination:

$query = new Query('content');
$query
	->setLimit(10)  // 10 results per page
	->setOffset(20) // third page
;
$resultSet = $finder->find($query);

Adjust the relevance for favorite and popular pages:

use S2\Rose\Entity\ExternalId;

$resultSet = $finder->find(new Query('content'));
$externalId1 = $resultSet->getFoundExternalIds()->toArray()[0];
var_dump($externalId1->getId(), $externalId1->getInstanceId()); // id_1 1
$resultSet->setRelevanceRatio($externalId1, 3.14);
$resultSet->setRelevanceRatio(new ExternalId('id_2', null), 2);

foreach ($resultSet->getItems() as $item) {
	                         // first iteration:          second iteration:
	$item->getId();          // 'id_2'                    'id_1'
	$item->getRelevance();   // 62.0                      3.14
}

Provide instance id to limit the scope of the search with a subsystem:

$resultSet = $finder->find((new Query('content'))->setInstanceId(1));
$resultSet->setRelevanceRatio('id_1', 3.14);

foreach ($resultSet->getItems() as $item) {
	                         // first iteration:
	$item->getId();          // 'id_1'
	$item->getInstanceId();  // 1
}

Highlighting and Snippets

It's a common practice to highlight the found words in the search results. You can obtain the highlighted title:

$resultSet = $finder->find(new Query('title'));
$resultSet->getItems()[0]->getHighlightedTitle($stemmer); // 'Test page <i>title</i>'

This method requires the stemmer since it takes into account the morphology and highlights all the word forms. By default, words are highlighted with italics. You can change the highlight template by calling $finder->setHighlightTemplate('<b>%s</b>').

Snippets are small text fragments containing found words displaying in the search result. SnippetBuilder processes the source and selects best matching sentences. It should be done just before $resultSet->getItems():

use S2\Rose\Entity\ExternalContent;
use S2\Rose\SnippetBuilder;

$snippetBuilder = new SnippetBuilder($stemmer);
$this->snippetBuilder->setSnippetLineSeparator(' &middot; '); // Set snippet line separator. Default is '... '.
$snippetBuilder->attachSnippets($resultSet, static function (array $externalIds) {
    /** @var \S2\Rose\Entity\ExternalId[] $externalIds */

	$result = new ExternalContent();
	foreach ($externalIds as $externalId) {
		if ($externalId->getId() === 'id_1') {
			$result->attach($externalId, 'This page is to be indexed. I have to make up a content.');
		}
		else {
			$result->attach($externalId, 'This is the second page to be indexed. Let\'s compose something new.');
		}
	}
	return $result;
});

$resultSet->getItems()[0]->getSnippet(); // 'I have to make up a <i>content</i>.'

Words in snippets are highlighted the same way as in titles.

Building snippets is quite a heavy operation. Use it with pagination to reduce the snippet generation time.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].