All Projects → yolanda93 → information_retrieval_system

yolanda93 / information_retrieval_system

Licence: other
The goal of this project is to implement a basic information retrieval system using Python, NLTK and GenSIM.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to information retrieval system

allsummarizer
Multilingual automatic text summarizer using statistical approach and extraction
Stars: ✭ 28 (+12%)
Mutual labels:  information-retrieval
Mimir
OSINT Threat Intel Interface - CLI for HoneyDB
Stars: ✭ 104 (+316%)
Mutual labels:  information-retrieval
Information-Retrieval
Information Retrieval algorithms developed in python. To follow the blog posts, click on the link:
Stars: ✭ 103 (+312%)
Mutual labels:  information-retrieval
memex-gate
General Architecture for Text Engineering
Stars: ✭ 47 (+88%)
Mutual labels:  information-retrieval
AI booklet CE-AUT
Booklet and exam of Artificial Intelligence Master Degree at Amirkabir University of technology.
Stars: ✭ 14 (-44%)
Mutual labels:  information-retrieval
Intention-Mining-
Intention Mining in Social Networking. It Mines Emotions and polarity for the given keyword . For the keyword it searchers the twitter for the comments and analyzes the results for various events such as Election results, Sports prediction Movie ratings, Breaking news events such as demonetisation and many more. Bayes , Maximum Entropy and Hidde…
Stars: ✭ 19 (-24%)
Mutual labels:  information-retrieval
nalcos
Search Git commits in natural language
Stars: ✭ 50 (+100%)
Mutual labels:  information-retrieval
tika-similarity
Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Stars: ✭ 92 (+268%)
Mutual labels:  information-retrieval
ir datasets
Provides a common interface to many IR ranking datasets.
Stars: ✭ 190 (+660%)
Mutual labels:  information-retrieval
JPQ
CIKM'21: JPQ substantially improves the efficiency of Dense Retrieval with 30x compression ratio, 10x CPU speedup and 2x GPU speedup.
Stars: ✭ 39 (+56%)
Mutual labels:  information-retrieval
FieldedSDM
Fielded Sequential Dependence Model (code and runs)
Stars: ✭ 32 (+28%)
Mutual labels:  information-retrieval
ake-datasets
Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.
Stars: ✭ 125 (+400%)
Mutual labels:  information-retrieval
intergo
A package for interleaving / multileaving ranking generation in go
Stars: ✭ 30 (+20%)
Mutual labels:  information-retrieval
query completion
Personalized Query Completion
Stars: ✭ 24 (-4%)
Mutual labels:  information-retrieval
lucene
Apache Lucene open-source search software
Stars: ✭ 1,009 (+3936%)
Mutual labels:  information-retrieval
IP-Tracker
Track any ip address with IP-Tracker. IP-Tracker is developed for Linux and Termux. you can retrieve any ip address information using IP-Tracker.
Stars: ✭ 53 (+112%)
Mutual labels:  information-retrieval
llda
Labeled LDA in Python
Stars: ✭ 19 (-24%)
Mutual labels:  information-retrieval
evildork
Evildork targeting your fiancee👁️
Stars: ✭ 46 (+84%)
Mutual labels:  information-retrieval
learning2hash.github.io
Website for "A survey of learning to hash for Computer Vision" https://learning2hash.github.io
Stars: ✭ 14 (-44%)
Mutual labels:  information-retrieval
RankyMcRankFace
Hardened Fork of Ranklib learning to rank library
Stars: ✭ 41 (+64%)
Mutual labels:  information-retrieval

Information Retrieval System

The goal of this project is to implement an information retrieval system using Python, NLTK and GenSIM.

To build this system, it is provided a plain text MED.ALL that contains many documents related to life sciences. Each document is composed by 2 fields (.I and .W). The field .I contains a numeric ID that identifies the document, while the field .W contains the text of the document.

For this system is created 7 different versions of the IR using different weights for building the vectors representing documents and queries.

The program supplies an entry point to enable the user to launch queries and choose the desired IR system with the implemented models.

Methods used to build the vectors:

(0) Standard Boolean Boolean

(1) TF weights

(2) TF-IDF weights

(3) Latent Dirichlet Allocation

(4) Latent Dirichlet Allocation Multicore

(5) Latent Semantic Indexing

(6) Random Projections

(7) Log Entropy Model

Usage:

Script parameters:

Usage example:


![Alt text] (https://github.com/yolanda93/information_retrieval_system/blob/master/documents/images/Boolean%20Model.png "Usage example")

Implementation

Standard Boolean Model

The Standard Boolean Model is most adopted information retrieval model and it is based on Boolean logic and classical set theory.

In order to implement this model it is used classical set theory. Therefore, the text is divided into phrases and then it is searched whithin each frase to find or operators.

Algorithm steps 0. Check if there are or operators. if not goto 5.

  1. The query text is splitted into phrases based on "." tokens. These tokens are translated into an AND operator set
  2. For each phrase is searched the "or" token and splitted again. These tokens are translated into an OR operator set
  3. Look If there are AND_operator_sets (AND_operator_sets>1) to perform a query, goto 5. (The AND operator set is treated a single query in which all terms must appear in the text)
  4. For each element of the OR operator set goto 5 (the element is splitted using the OR as separator and each resulting phrase is executed as a single query)
  5. Execute the query and append documents to the final result if they dont exist already.

Example queries

"blood or urinary steroids in human breast or prostatic neoplasms."

Algorithm result

  1. S; S = " blood or urinary steroids in human breast or prostatic neoplasms."; lenght of AND_operator_set == 1
  2. S = AvBvC; A = "blood", B = "urinary steroids in human breast", C = "prostatic neoplasms."
  3. AND_operator_set == 1
  4. Look for documents matches that dont exist already.
  5. Result = A_doc_matches + B_doc_matches + C_doc_matches

Evaluating IR Systems

Evaluation the performance of the generated information retrieval models by comparing their average precision/recall curves for 30 different queries provided in the file MED.QRY.

a third file called MED.REL is provided, which contains relevance assessments for each query in MED.QRY. Each line belonging to this file contains 4 colums, where only the first and the third are relevant for our purpose. The first colum identifies a query while the third column represents a document. Therefore the line:

1 0 13 1

Indicates that document with ID 13 is relevant to query 1 (the second and fourth column must be ignored). Another example:

2 0 296 1

indicates that document #296 is relevant to query #2.

For further information on how to create the average Precision/Recall curves please see the document Evaluation_Measures.pdf a third file called MED.REL is provided, which contains relevance assessments for each query in MED.QRY. Each line belonging to this file contains 4 colums, where only the first and the third are relevant for our purpose. The first colum identifies a query while the third column represents a document. Therefore the line:

1 0 13 1

Indicates that document with ID 13 is relevant to query 1 (the second and fourth column must be ignored). Another example:

2 0 296 1

indicates that document #296 is relevant to query #2.

For further information on how to create the average Precision/Recall curves please see the document Evaluation_Measures.pdf

Rocchio's relevance feedback schema

The Rocchio's relevance feedback schema allows the user to improve the system's performance by incrementally reformulating the user query based on the relevance assessments provided by the user.

The Rocchio's relevance feedback scheme is described in the paper "Relevance Feedback in Information Retrieval" (1965) (Documentation)

Steps:

  1. The user launchs a query to the system.

  2. The system returns a ranking of the documents according to the query.

  3. Generate a P/R curve that characterises the performance of the system wrt the query (using the relevance assessments provided in MED.REL).

  4. The user chooses the X (e.g. X=20) first documents in the ranking and marks them as being relevant or non relevant according to the relevance assessments in MED.REL.

  5. According these relevance judgements, the system updates the original query based on Rocchio's formula.

  6. The system launchs the new query and presents a new ranking.

  7. A new P/R curve is generated and compared to the previous one. Is the system improving in precision and/or recall?

  8. While not satisfied goto 4.

    Rocchio's formula

![Alt text] (https://github.com/yolanda93/information_retrieval_system/blob/master/documents/images/rocchio-formula.png "Rocchio formula")

Variable Value
Q_m Modified Query Vector
Q_o Original Query Vector
D_j Related Document Vector
D_k Non-Related Document Vector
a Original Query Weight
b Related Documents Weight
c Non-Related Documents Weight
D_r Set of Related Documents
D_nr Set of Non-Related Documents

Contact information

Yolanda de la Hoz Simón. [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].