All Projects → jermp → autocomplete

jermp / autocomplete

Licence: MIT License
Efficient and effective query auto-completion in C++.

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
CMake
9771 projects
shell
77523 projects
javascript
184084 projects - #8 most used programming language
HTML
75241 projects

Projects that are alternatives of or similar to autocomplete

extensions
Angular Material Extensions Library.
Stars: ✭ 203 (+625%)
Mutual labels:  autocomplete
prosemirror-autocomplete
Autocomplete suggestions for ProseMirror
Stars: ✭ 23 (-17.86%)
Mutual labels:  autocomplete
clui
A command system with extra steps
Stars: ✭ 18 (-35.71%)
Mutual labels:  autocomplete
information retrieval system
The goal of this project is to implement a basic information retrieval system using Python, NLTK and GenSIM.
Stars: ✭ 25 (-10.71%)
Mutual labels:  information-retrieval
see
Search Engine in Erlang
Stars: ✭ 27 (-3.57%)
Mutual labels:  information-retrieval
textgo
Text preprocessing, representation, similarity calculation, text search and classification. Let's go and play with text!
Stars: ✭ 33 (+17.86%)
Mutual labels:  text-search
react-native-dropdown-autocomplete
Autocomplete input with dropdown modal component for React native. Useful for pages with multiple autocomplete's.
Stars: ✭ 97 (+246.43%)
Mutual labels:  autocomplete
nama
Namespaced Aliases: organize and easily find aliases using autocomplete & namespacing
Stars: ✭ 26 (-7.14%)
Mutual labels:  autocomplete
wsdm-digg-2020
No description or website provided.
Stars: ✭ 15 (-46.43%)
Mutual labels:  information-retrieval
cherche
📑 Neural Search
Stars: ✭ 196 (+600%)
Mutual labels:  information-retrieval
td-completes-me
A Simple Auto Completion Engine for Touch Designer
Stars: ✭ 85 (+203.57%)
Mutual labels:  autocomplete
lsp signature.nvim
LSP signature hint as you type
Stars: ✭ 915 (+3167.86%)
Mutual labels:  autocomplete
frontal
An Angular select/dropdown component
Stars: ✭ 20 (-28.57%)
Mutual labels:  autocomplete
evildork
Evildork targeting your fiancee👁️
Stars: ✭ 46 (+64.29%)
Mutual labels:  information-retrieval
awesome-semantic-search
A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.
Stars: ✭ 161 (+475%)
Mutual labels:  information-retrieval
tika-similarity
Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Stars: ✭ 92 (+228.57%)
Mutual labels:  information-retrieval
SWDM
SIGIR 2017: Embedding-based query expansion for weighted sequential dependence retrieval model
Stars: ✭ 35 (+25%)
Mutual labels:  information-retrieval
config
Fig's integrations with bash, zsh, fish, ssh, and tmux. Also contains Fig's installation and update scripts. Finally, this repo is the root of the .fig folder that is installed on your computer when you download Fig!
Stars: ✭ 44 (+57.14%)
Mutual labels:  autocomplete
tabnine-vim
Vim client for TabNine. https://vimawesome.com/plugin/tabnine-vim
Stars: ✭ 652 (+2228.57%)
Mutual labels:  autocomplete
autocomplete
Simple accessible autocomplete for vanilla javacript with support for remote & local data, ~3KB gzip
Stars: ✭ 38 (+35.71%)
Mutual labels:  autocomplete

Language grade: C/C++

Autocomplete

A Query Auto-Completion system based on the paper Efficient and Effective Query Auto-Completion, by Simon Gog, Giulio Ermanno Pibiri, and Rossano Venturini, published in ACM SIGIR 2020.

Table of contents
  1. Installation and quick start
  2. Compiling the code
  3. Input data format
  4. Running the unit tests
  5. Building an index
  6. Benchmarks
  7. Live demo

Installation and quick start

Just run

bash ./install.sh

from the parent directory. The script builds the code; prepare the test data in the folder test_data/trec_05_efficiency_queries for indexing; executes the unit tests.

After that, for having a minimal running example, just run

bash ./example.sh

and then access the service from localhost.

Or you can use a prebuilt Docker image

The following command pulls a prebuilt Docker image and runs it locally.

docker pull jermp/autocomplete
docker run -p 8000:8000 -d jermp/autocomplete

And then access the service from localhost.

Compiling the code

The code has been tested on Linux with gcc 7.4.0, 8.3.0, 9.0.0 and on Mac 10.14 with clang 10.0.0. To build the code, CMake is required.

Clone the repository with

git clone --recursive https://github.com/jermp/autocomplete.git

If you have cloned the repository without --recursive, you will need to perform the following commands before compiling:

git submodule init
git submodule update

To compile the code for a release environment (see file CMakeLists.txt for the used compilation flags), it is sufficient to do the following:

mkdir build
cd build
cmake ..
make

Hint: Use make -j to compile the library in parallel using all available threads.

For the best of performance, we recommend compiling with:

cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_SANITIZERS=Off -DUSE_INTRINSICS=On -DUSE_PDEP=On

For a testing environment, use the following instead:

mkdir debug_build
cd debug_build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DUSE_SANITIZERS=On
make

Input data format

The input file should list all completions in lexicographical order. For example, see the the file test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions.

The first column represent the ID of the completion; the other columns contain the tokens separated by white spaces. (The IDs for the file trec_05_efficiency_queries.completions are fake, i.e., they do not take into account any particular assignment.)

The script preprocess.sh in the directory test_data helps in preparing the data for indexing. Thus, from within the directory test_data, it is sufficient to do:

bash preprocess.sh <test_collection> <num_queries>

Therefore, for our example with trec_05_efficiency_queries, it would be:

bash preprocess.sh trec_05_efficiency_queries/trec_05_efficiency_queries.completions 300

The second argument in the example, i.e., 300, represents the number of completions (per completion size) that are drawn at random and could be used to query the indexes.

If you run the script, you will get:

  • trec_05_efficiency_queries.completions.dict: lists all the distinct tokens in the completions sorted in lexicographical order.

  • trec_05_efficiency_queries.completions.mapped: lists all completions whose tokens have been mapped to integer ids as assigned by a lexicographically-sorted string dictionary (that should be built from the tokens listed in trec_05_efficiency_queries.completions.dict). Each completion terminates with the id 0.

  • trec_05_efficiency_queries.completions.mapped.stats contains some statistics about the datasets, needed to build the data structures more efficiently.

  • trec05_efficiency_queries.completions.inverted is the inverted file.

  • trec_05_efficiency_queries.completions.forward is the forward file. Note that each list is not sorted, thus the lists are the same as the ones contained in trec_05_efficiency_queries.completions.mapped but sorted in docID order.

Running the unit tests

The unit tests are written using doctest.

After compilation and preparation of the data for indexing (see Section Input data format), it is advised to run the unit tests with:

make test

Building an index

After compiling the code, run the program ./build to build an index. You can specify the type of the index and the name of the file where the index will be written.

For example, with

./build ef_type1 ../test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions -o trec05.ef_type1.bin

we can build an index of type ef_type1 from the test file ../test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions, that will be serialized to the file trec05.ef_type1.bin.

Possible types are ef_type1, ef_type2, ef_type3 and ef_type4.

Note: the type ef_type4 requires an extra parameter to be specified, c. Use for example: -c 0.0001.

Benchmarks

To run the top-k benchmarks in the /benchmark directory, we first need some query logs. They should have been created already if you have run the script preprocess.sh, otherwise you can use

python partition_queries_by_length.py trec_05_efficiency_queries/trec_05_efficiency_queries.completions trec_05_efficiency_queries/trec_05_efficiency_queries.completions.queries 300

to partition the input completions by number of query terms and retain 300 queries at random. Query files are placed in the output directory trec_05_efficiency_queries/trec_05_efficiency_queries.completions.queries. (By default, 7 shards will be created: the ones having [1,6] query terms and the one collecting all completions with at least 7 query terms).

Then the command

./benchmark_topk ef_type1 10 trec05.ef_type1.bin 3 300 0.25 < ../test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions.queries/queries.length=3.shuffled

will execute 1000 top-10 queries with 3 terms, from which only 25% of the prefix of the last token is retained.

We automated the collection of results with the script script/collected_topk_results_by_varying_percentage.py. From within the /build directory, run

python ../script/collect_results_by_varying_percentage.py ef_type1 topk trec05.ef_type1.bin ../test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions 10 300

To benchmark the dictionaries (Front-Coding and trie), just run the following script from within the script directory:

bash benchmark_dictionaries.sh

Live demo

Start the web server with the program ./web_server <port> <index_filename> and access the demo at localhost:<port>.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].