All Projects → o19s → elasticsearch-ltr-demo

o19s / elasticsearch-ltr-demo

Licence: other
This demo uses data from TheMovieDB (TMDB) to demonstrate using Ranklib learning to rank models with Elasticsearch.

Programming Languages

HTML
75241 projects
python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
Dockerfile
14818 projects
shell
77523 projects

Projects that are alternatives of or similar to elasticsearch-ltr-demo

Spotlight
Deep recommender models using PyTorch.
Stars: ✭ 2,623 (+7614.71%)
Mutual labels:  learning-to-rank
Ranking
Learning to Rank in TensorFlow
Stars: ✭ 2,362 (+6847.06%)
Mutual labels:  learning-to-rank
Lightfm
A Python implementation of LightFM, a hybrid recommendation algorithm.
Stars: ✭ 3,884 (+11323.53%)
Mutual labels:  learning-to-rank
SERank
An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.
Stars: ✭ 42 (+23.53%)
Mutual labels:  learning-to-rank
recsys2019
The complete code and notebooks used for the ACM Recommender Systems Challenge 2019
Stars: ✭ 26 (-23.53%)
Mutual labels:  learning-to-rank
stringsifter
A machine learning tool that ranks strings based on their relevance for malware analysis.
Stars: ✭ 567 (+1567.65%)
Mutual labels:  learning-to-rank
Ranked-List-Loss-for-DML
CVPR 2019: Ranked List Loss for Deep Metric Learning, with extension for TPAMI submission
Stars: ✭ 56 (+64.71%)
Mutual labels:  learning-to-rank
ltr-tools
Set of command line tools for Learning To Rank
Stars: ✭ 13 (-61.76%)
Mutual labels:  learning-to-rank
src
tools for fast reading of docs
Stars: ✭ 40 (+17.65%)
Mutual labels:  learning-to-rank
FastAP-metric-learning
Code for CVPR 2019 paper "Deep Metric Learning to Rank"
Stars: ✭ 93 (+173.53%)
Mutual labels:  learning-to-rank
fastrank
My most frequently used learning-to-rank algorithms ported to rust for efficiency. Try it: "pip install fastrank".
Stars: ✭ 43 (+26.47%)
Mutual labels:  learning-to-rank
EMNLP2020
This is official Pytorch code and datasets of the paper "Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News", EMNLP 2020.
Stars: ✭ 55 (+61.76%)
Mutual labels:  learning-to-rank

Learning to Rank Demo

This demo uses data from TheMovieDB (TMDB) to demonstrate using Ranklib learning to rank models with Elasticsearch.

You can go through the individual steps, or if you want to just skip to the end, you can use Docker:

docker-compose up

And browse to http://localhost:8000

Install Dependencies and prep data...

This demo requires

  • Python 3+
  • Python elasticsearch and requests libraries
pip3 install requests elasticsearch5 parse jinja

Download the TMDB Data & Ranklib Jar

The first time you run this demo, fetch RankyMcRankFace.jar (used to train model) and tmdb.json (the dataset used)

cd train
./prepare.sh

Start Elasticsearch/install plugin

Start a supported version of Elasticsearch and follow the instructions to install the learning to rank plugin.

docker run -d -p 9201:9200 -p 9301:9300 -e "discovery.type=single-node" --name elasticsearch5 elasticsearch:5.6.4

Index to Elasticsearch

This script will create a 'tmdb' index with default/simple mappings. You can edit this file to play with mappings.

python indexMlTmdb.py

Onto the machine learning...

TLDR

If you're actually going to build a learning to rank system, read past this section. But to sum up, the full Movie demo can be run by

python train.py

Then you can search using

python search.py Rambo

and search results can be printed to the console.

More on how all this actually works below:

Create and upload features (loadFeatures.py)

A "feature" in ES LTR corresponds to an Elasticsearch query. The score yielded by the query is used to train and evaluate the model. For example, if you feel that a TF*IDF title score corresponds to higher relevance, then that's a feature you'd want to train on! Other features might include how old a movie is, the number of keywords in a query, or whatever else you suspect might correlate to your user's sense of relevance.

If you examine loadFeatures.py you'll see how we create features. We first initialize the default feature store (PUT /_ltr). We create a feature set (POST /_ltr/_featureset/movie_features). Now we have a place to create features for both logging & use by our models!

In the demo features 1...n json are mustache templates that correspond to the features. In this case, the features are identified by ordinal (feature 1 is in 1.json). They are uploaded to Elasticsearch Learning to Rank with these ordinals as the feature name. In eachFeature, you'll see a loop where we access each mustache template an the file system and return a JSON body for adding the feature to Elasticsearch.

For traditional Ranklib models, the ordinal is the only way features are identified. Other models use feature names which make developing, logging, and managing features more maintainable.

Gather Judgments (movie_judgments.txt)

The first part of the training data is the judgment list. We've provided one in movie_judgments.txt.

What's a judgment list? A judgment list tells us how relevant a document is for a search query. In other words, a three-tuple of

<grade>,<docId>,<keywords>

Quality comes in the form of grades. For example if movie "First Blood" is considered extremely relevant for the query Rambo, we give it a grade of 4 ('exactly relevant'). The movie Bambi would receive a '0'. Instead of the notional CSV format above, Ranklib and other learning to rank systems use a format from LibSVM, shown below:

# qid:1: rambo
#
#
# grade (0-4)	queryid	 # docId	title
4	qid:1 #	7555	Rambo

You'll notice we bastardize this syntax to add comments identifying the keywords associated with each query id, and append metadata to each line. Code provided in judgments.py handles this syntax.

Log features (collectFeatures.py)

You saw above how we created features, the next step is to log features for each judgment 3-tuple. This code is in collectFeatures.py. Logging features can be done in several different contexts. Of course, in a production system, you may wish to log features as users search. In other contexts, you may have a hand-created judgment list (as we do) and wish to simply ask Elasticsearch Learning to Rank for feature values for query/document pairs.

Is collectFeatures.py, you'll see an sltr query is included. This query points to a featureSet, not a model. So it does not influence the score. We filter down to needed document ids for each keyword and allow this sltr query to run.

You'll also notice an ext component in the request. This search extension is part of the Elasticsearch Learning to Rank plugin and allows you to configure feature logging. You'll noticed it refers to the query name of sltr, allowing it to pluck out the sltr query and perform logging associated with the feature set.

Once features are gathered, the judgment list is fleshed out with feature value, the ordinals below corresponding to the features in our 1..n.json files.

4	qid:1	1:12.318446	2:9.8376875 # 7555	rambo

Train (train.py and RankLib.jar)

With training data in place, it's time to ask RankLib to train a model, and output to a test file. RankLib supports linear models, ListNet, and several tree-based models such as LambdaMART. In train.py you'll notice how RankLib is called with command line arguments. Models test_N are created in our feature store for each type of RankLib model. In the saveModel function, you can see how the model is uploaded to our "movie_features" feature set.

Search using the model (search.py)

See what sort of search results you get! In search.py you'll see we execute the sltr query referring to a test_N model in the rescore phase. By default test_6 is used (corresponding to LambdaMART), but you can change the used model at the command line.

Search with default LambdaMART:

python search.py rambo

Try a different model:

python search.py rambo test_8

Run the HTTP demo

In the /app directory, to run the search page so you can poke and prod run:

./srv.sh

Browse to http://localhost:8000

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].