All Projects → matehat → Mhtextsearch

matehat / Mhtextsearch

Licence: mit
A fast full-text search library for Objective-C

Projects that are alternatives of or similar to Mhtextsearch

auctus
Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index
Stars: ✭ 34 (-56.96%)
Mutual labels:  search-engine, index
Srchx
A standalone lightweight full-text search engine built on top of blevesearch and Go with multiple storage (scorch, boltdb, leveldb, badger)
Stars: ✭ 118 (+49.37%)
Mutual labels:  search-engine, leveldb
Riot
Go Open Source, Distributed, Simple and efficient Search Engine; Warning: This is V1 and beta version, because of big memory consume, and the V2 will be rewrite all code.
Stars: ✭ 6,025 (+7526.58%)
Mutual labels:  search-engine, index
Sonic
🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.
Stars: ✭ 12,347 (+15529.11%)
Mutual labels:  search-engine, index
indexd
An external bitcoind index management service module
Stars: ✭ 50 (-36.71%)
Mutual labels:  leveldb, index
flow-indexer
Flow-Indexer indexes flows found in chunked log files from bro,nfdump,syslog, or pcap files
Stars: ✭ 43 (-45.57%)
Mutual labels:  search-engine, index
Blast
Blast is a full text search and indexing server, written in Go, built on top of Bleve.
Stars: ✭ 934 (+1082.28%)
Mutual labels:  search-engine, index
Elasticsearch Spring Boot Spring Data
🏆 Starter example for using Elasticsearch repository with Springboot
Stars: ✭ 65 (-17.72%)
Mutual labels:  search-engine
100projectsofcode
A list of practical knowledge-building projects.
Stars: ✭ 1,183 (+1397.47%)
Mutual labels:  search-engine
Learned Indexes
Implementation of BTree part for paper 'The Case for Learned Index Structures'
Stars: ✭ 64 (-18.99%)
Mutual labels:  index
Goshare
Go Share your TimeSeries/NameSpace/KeyVal DataStore (using leveldb) over HTTP &/or ZeroMQ
Stars: ✭ 59 (-25.32%)
Mutual labels:  leveldb
Mailpile
A free & open modern, fast email client with user-friendly encryption and privacy features
Stars: ✭ 8,533 (+10701.27%)
Mutual labels:  search-engine
Shodanz
🔭 A modern, async Ruby gem for Shodan, the world's first search engine for Internet-connected devices.
Stars: ✭ 74 (-6.33%)
Mutual labels:  search-engine
Tinspin Indexes
Spatial index library with R*Tree, STR-Tree, Quadtree, CritBit, KD-Tree, CoverTree
Stars: ✭ 64 (-18.99%)
Mutual labels:  index
Xinahn Socket
一个开源,高隐私,自架自用的聚合搜索引擎。 https://xinahn.com
Stars: ✭ 77 (-2.53%)
Mutual labels:  search-engine
Awesome Cn
awesome项目中文翻译,提升查阅效率
Stars: ✭ 62 (-21.52%)
Mutual labels:  index
Cssdbpy
Fastest SSDB client written on Cython. Production ready
Stars: ✭ 79 (+0%)
Mutual labels:  leveldb
Flask Ponywhoosh
A Flask full-text search engine
Stars: ✭ 76 (-3.8%)
Mutual labels:  search-engine
Gnes
GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.
Stars: ✭ 1,178 (+1391.14%)
Mutual labels:  search-engine
Vectorsinsearch
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015
Stars: ✭ 71 (-10.13%)
Mutual labels:  search-engine

MHTextSearch

A fast & minimal embedded full-text indexing library, written in Objective-C, built on top of Objective-LevelDB.

Installation

By far, the easiest way to integrate this library in your project is by using CocoaPods.

  1. Have CocoaPods installed, if you don't already

  2. In your Podfile, add the line

     pod 'MHTextSearch'
    
  3. Run pod install

  4. Add the libc++.dylib Framework to your project.

Simple API

Create a embedded textual index
MHTextIndex *index = [MHTextIndex textIndexInLibraryWithName:@"my.awesome.index"];
Index any objects

You can tell a MHTextIndex instance to index your objects (any object)

[index indexObject:anyObjectYouWant];
[index updateIndexForObject:anotherPreviousIndexedObject];
[index removeIndexForObject:anotherPreviousIndexedObject];

But for this to work, you need to tell us what identifier as NSData * can be used to uniquely refer to this object.

[index setIdentifier:^NSData *(MyCustomObject *object){
    return object.indexID; // a NSData instance
}];

You also need to give us details about the objects, like what are the pieces of text to index

[index setIndexer:^MHIndexedObject *(MyCustomObject *object, NSData *identifier){
    MHIndexedObject *indx = [MHIndexedObject new];
    indx.strings = @[ object.title, object.description ]; // Indexed strings
    indx.weight = object.awesomenessLevel;                // Weight given to this object, when sorting results
    indx.context = @{@"title": object.title};             // A NSDictionary that will be given alongside search results
    return indx;
}];

Finally, if you want to be able to get easy reference to your original object when you get search results, you can tell us how to do that for you

[index setObjectGetter:^MyCustomObject *(NSData *identifier){
    return [MyCustomObject customObjectFromIdentifier:identifier];
}];

and that's it! That's all you need to get a full-text index going. MHTextSearch takes care of splitting text into words, factoring out diacritics and capitalization, all with respect to locale, as you'd expect (well, Foundation does most of the job here).

You can then start searching:

[index enumerateObjectsForKeyword:@"duck" options:0 withBlock:^(MHSearchResultItem *item, 
                                                                NSUInteger rank, 
                                                                NSUInteger count, 
                                                                BOOL *stop){
                                                                    
    item.weight;      // As provided by you earlier
    item.rank;        // The effective rank in the search result
    item.object;      // The first time it is used, it will use the block
                            // you provided earlier to get the object
    item.context;     // The dictionary you provided in the "indexer" block
    item.identifier;  // The object identifier you provided in the "identifier" block

    NSIndexPath *token = item.resultTokens[0]; 
    /* This is an NSArray of NSIndexPath instances, each containing 3 indices:
     *   - mh_string : the string in which the token occured 
     *                 (here, 0 for the object's title)
     *   - mh_word : the position in the string where the word containing
     *               the token occured
     *   - mh_token : the position in the word where the token occured
     */
                        
    NSRange tokenRange = [item rangeOfToken:token];
    /* This gives the exact range of the matched token in the string where it was found.
     *
     * So, according to the example setup I've been giving from the start,  
     * if token.mh_string == 0, that means the token was found in the object's "title",
     * and [item.object.title substringWithRange:tokenRange] would yield "duck" (minus 
     * capitalization and diacritics).
     */
}];

You can also fetch the whole array of MHSearchResultItem instances at once using

NSArray *resultSet = [index searchResultForKeyword:@"duck"
                                           options:NSEnumerationReverse];
Subclassing

If giving blocks for specifying behavior is not your thing, you can also override the following methods:

  • -[MHTextIndex getIdentifierForObject:] which, by default uses the identifier block
  • -[MHTextIndex getIndexInfoForObject:andIdentifier:] which, by default uses the indexer block
  • -[MHTextIndex compareResultItem:withItem:reversed:] which is used to order the search result set

Using with Core Data

You can use NSManagedObject lifecycle methods to trigger changes to the text index. The following example was taken from http://www.adevelopingstory.com/blog/2013/04/adding-full-text-search-to-core-data.html and adapted to use with this project:


- (void)prepareForDeletion
{
    [super prepareForDeletion];

    if (self.indexID.length) {
        [textindex deleteIndexForObject:self.indexID];
    }
}

+ (NSData *)createIndexID {
    NSUUID *uuid = [NSUUID UUID];
    uuid_t uuidBytes;
    [uuid getUUIDBytes:uuidBytes];
    return [NSData dataWithBytes:uuidBytes length:16];
}

- (void)willSave
{
    [super willSave];

    if (self.indexID.length) {
        [textindex updateIndexForObject:self.indexID];
    } else {
        self.indexID = [[self class] createIndexID];
        [textindex indexObject:self.indexID];
    }
}

Operations & Queues

MHTextIndex uses a NSOperationQueue under the hood to coordinate indexing operations. It is exposed as a property named indexingQueue. You can thus set its maxConcurrentOperationCount property to control how concurrent the indexing can be. Since the underlying database library performing I/O is thread-safe, concurrency is not a problem. This also means you can explicitly wait for indexing operations to finish using:

[index.indexingQueue waitUntilAllOperationsAreFinished];

The three indexing methods -[MHTextIndex indexObject:], -[MHTextIndex updateIndexForObject:], -[MHTextIndex removeIndexForObject:] all return NSOperation instances, that you can take advantage of, if you need, using its completionBlock property or -[NSOperation waitUntilFinished] method.

Searching is also concurrent, but it uses a dispatch_queue_t (not yet exposed or tunable).

Performance & Fine tuning

There are a few knobs you can play with to make MHTextSeach better fit your needs.

  • A MHTextIndex instance has a skipStopWords boolean property that is true by default and that avoids indexing very common english words. (TODO: make that work with other languages)

  • It also has a minimalTokenLength that is equal to 2 by default. This sets a minimum for the number of letters that a token needs to be for it to be indexed. This also greatly minimizes the size of the index, as well as the indexing and searching time. It skips indexing single-letter words and the last letter of every word, when set to 2.

  • When indexing long form texts (documents rather than, say, simple names), you can turn on the discardDuplicateTokens boolean property on MHTextIndex. This makes the index only consider the first occurence of every indexed token for a given piece of texts. If you are okay with only knowing if a token appears in a text, rather than where does every occurence appear, you can gain a speed bump in indexing time, by a factor of 3 to 5.

The following graphs show the indexing and searching time (in seconds), as a function of the size of text indexed, ranging from 500 KB to about 10 MB. The benchmarks were run on an iPhone 5.

Testing

If you want to run the tests, you will need Xcode 5, as the test suite uses the new XCTest.

Clone this repository and, once in it,

$ cd MHTextSearch\ iOS\ Tests
$ pod install
$ cd .. && open *.xcworkspace

Currently, all tests were setup to work with the iOS test suite.

License

Distributed under the MIT license

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].