All Projects → thunlp → Openmatch

thunlp / Openmatch

Licence: mit
An Open-Source Package for Information Retrieval.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Openmatch

Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+1732.8%)
Mutual labels:  information-retrieval
Tutorial Utilizing Kg
Resources for Tutorial on "Utilizing Knowledge Graphs in Text-centric Information Retrieval"
Stars: ✭ 148 (-20.43%)
Mutual labels:  information-retrieval
Books
Books worth spreading
Stars: ✭ 161 (-13.44%)
Mutual labels:  information-retrieval
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-33.33%)
Mutual labels:  information-retrieval
Entityduetneuralranking
Entity-Duet Neural Ranking Model
Stars: ✭ 137 (-26.34%)
Mutual labels:  information-retrieval
Terrier Core
Terrier IR Platform
Stars: ✭ 156 (-16.13%)
Mutual labels:  information-retrieval
Pytrec eval
pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval.
Stars: ✭ 114 (-38.71%)
Mutual labels:  information-retrieval
Neuralqa
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT
Stars: ✭ 185 (-0.54%)
Mutual labels:  information-retrieval
Invoicenet
Deep neural network to extract intelligent information from invoice documents.
Stars: ✭ 1,886 (+913.98%)
Mutual labels:  information-retrieval
Bm25
A Python implementation of the BM25 ranking function.
Stars: ✭ 159 (-14.52%)
Mutual labels:  information-retrieval
Foundry
The Cognitive Foundry is an open-source Java library for building intelligent systems using machine learning
Stars: ✭ 124 (-33.33%)
Mutual labels:  information-retrieval
Easyocr
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Stars: ✭ 13,379 (+7093.01%)
Mutual labels:  information-retrieval
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+6761.83%)
Mutual labels:  information-retrieval
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-34.95%)
Mutual labels:  information-retrieval
Ranking
Learning to Rank in TensorFlow
Stars: ✭ 2,362 (+1169.89%)
Mutual labels:  information-retrieval
Scilla
🏴‍☠️ Information Gathering tool 🏴‍☠️ DNS / Subdomains / Ports / Directories enumeration
Stars: ✭ 116 (-37.63%)
Mutual labels:  information-retrieval
Pyserini
Python interface to the Anserini IR toolkit built on Lucene
Stars: ✭ 148 (-20.43%)
Mutual labels:  information-retrieval
Vec4ir
Word Embeddings for Information Retrieval
Stars: ✭ 188 (+1.08%)
Mutual labels:  information-retrieval
K Nrm
K-NRM: End-to-End Neural Ad-hoc Ranking with Kernel Pooling
Stars: ✭ 183 (-1.61%)
Mutual labels:  information-retrieval
Sf1r Lite
Search Formula-1——A distributed high performance massive data engine for enterprise/vertical search
Stars: ✭ 158 (-15.05%)
Mutual labels:  information-retrieval

OpenMatch

An Open-Source Package for Information Retrieval.

😃 What's New

  • Top Spot on TREC-COVID Challenge (May 2020, Round2)

    The twin goals of the challenge are to evaluate search algorithms and systems for helping scientists, clinicians, policy makers, and others manage the existing and rapidly growing corpus of scientific literature related to COVID-19, and to discover methods that will assist with managing scientific information in future global biomedical crises.
    >> Reproduce Our Submit >> About COVID-19 Dataset >> Our Paper

Overview

OpenMatch integrates excellent neural methods and technologies to provide a complete solution for deep text matching and understanding.

1/ Document Retrieval

Document Retrieval refers to extracting a set of related documents from large-scale document-level data based on user queries.

* Sparse Retrieval

Sparse Retriever is defined as a sparse bag-of-words retrieval model.

* Dense Retrieval

Dense Retriever performs retrieval by encoding documents and queries into dense low-dimensional vectors, and selecting the document that has the highest inner product with the query

2/ Document Reranking

Document reranking aims to further match user query and documents retrieved by the previous step with the purpose of obtaining a ranked list of relevant documents.

* Neural Ranker

Neural Ranker uses neural network as ranker to reorder documents.

* Feature Ensemble

Feature Ensemble can fuse neural features learned by neural ranker with the features of non-neural methods to obtain more robust performance

3/ Domain Transfer Learning

Domain Transfer Learning can leverages external knowledge graphs or weak supervision data to guide and help ranker to overcome data scarcity.

* Knowledge Enhancemnet

Knowledge Enhancement incorporates entity semantics of external knowledge graphs to enhance neural ranker.

* Data Augmentation

Data Augmentation leverages weak supervision data to improve the ranking accuracy in certain areas that lacks large scale relevance labels.

Stage Model Paper
1/ Sparse Retrieval BM25 Best Match25 ~Tool
1/ Dense Retrieval ANN Approximate nearest neighbor ~Tool
2/ Neural Ranker K-NRM End-to-End Neural Ad-hoc Ranking with Kernel Pooling ~Paper
2/ Neural Ranker Conv-KNRM Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search ~Paper
2/ Neural Ranker TK Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking ~Paper
2/ Neural Ranker BERT BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ~Paper
2/ Feature Ensemble Coordinate Ascent Linear feature-based models for information retrieval. Information Retrieval ~Paper
3/ Knowledge Enhancement EDRM Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval ~Paper
3/ Data Augmentation ReInfoSelect Selective Weak Supervision for Neural Information Retrieval ~Paper

Note that the BERT model is following huggingface's implementation - transformers, so other bert-like models are also available in our toolkit, e.g. electra, scibert.

Installation

* From PyPI

pip install git+https://github.com/thunlp/OpenMatch.git

* From Source

git clone https://github.com/thunlp/OpenMatch.git
cd OpenMatch
python setup.py install

* From Docker

To build an OpenMatch docker image from Dockerfile

docker build -t <image_name> .

To run your docker image just built above as a container

docker run --gpus all --name=<container_name> -it -v /:/all/ --rm <image_name>:<TAG>

Quick Start

* Detailed examples are available here.

import torch
import OpenMatch as om

query = "Classification treatment COVID-19"
doc = "By retrospectively tracking the dynamic changes of LYM% in death cases and cured cases, this study suggests that lymphocyte count is an effective and reliable indicator for disease classification and prognosis in COVID-19 patients."

* For bert-like models:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
input_ids = tokenizer.encode(query, doc)
model = om.models.Bert("allenai/scibert_scivocab_uncased")
ranking_score, ranking_features = model(torch.tensor(input_ids).unsqueeze(0))

* For other models:

tokenizer = om.data.tokenizers.WordTokenizer(pretrained="./data/glove.6B.300d.txt")
query_ids, query_masks = tokenizer.process(query, max_len=16)
doc_ids, doc_masks = tokenizer.process(doc, max_len=128)
model = om.models.KNRM(vocab_size=tokenizer.get_vocab_size(),
                       embed_dim=tokenizer.get_embed_dim(),
                       embed_matrix=tokenizer.get_embed_matrix())
ranking_score, ranking_features = model(torch.tensor(query_ids).unsqueeze(0),
                                        torch.tensor(query_masks).unsqueeze(0),
                                        torch.tensor(doc_ids).unsqueeze(0),
                                        torch.tensor(doc_masks).unsqueeze(0))

* The GloVe can be downloaded using:

wget http://nlp.stanford.edu/data/glove.6B.zip -P ./data
unzip ./data/glove.6B.zip -d ./data

* Evaluation

metric = om.Metric()
res = metric.get_metric(qrels, ranking_list, 'ndcg_cut_20')
res = metric.get_mrr(qrels, ranking_list, 'mrr_cut_10')

Experiments

* Ad-hoc Search

Retriever Reranker Coor-Ascent ClueWeb09 Robust04 ClueWeb12
SDM KNRM - 0.1880 0.3016 0.0968
SDM Conv-KNRM - 0.1894 0.2907 0.0896
SDM EDRM - 0.2015 0.2993 0.0937
SDM TK - 0.2306 0.2822 0.0966
SDM BERT Base - 0.2701 0.4168 0.1183
SDM ELECTRA Base - 0.2861 0.4668 0.1078

* MS MARCO Passage Ranking

Retriever Reranker Coor-Ascent dev eval
BM25 BERT Base - 0.349 0.345
BM25 ELECTRA Base - 0.352 0.344
BM25 RoBERTa Large - 0.386 0.375
BM25 ELECTRA Large - 0.388 0.376

* MS MARCO Document Ranking

Retriever Reranker Coor-Ascent dev eval
ANCE FirstP - - 0.373 0.334
ANCE MaxP - - 0.383 0.342
ANCE FirstP+BM25 BERT Base FirstP + 0.431 0.380
ANCE MaxP BERT Base MaxP + 0.432 0.391

Contribution

Thanks to all the people who contributed to OpenMatch!

Kaitao Zhang, Si Sun, Zhenghao Liu, Aowei Lu

Project Organizers

  • Zhiyuan Liu
  • Chenyan Xiong
  • Maosong Sun
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].