All Projects → salesforce → ml4ir

salesforce / ml4ir

Licence: Apache-2.0 license
Machine Learning for Information Retrieval

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
scala
5932 projects
java
68154 projects - #9 most used programming language
Makefile
30231 projects
shell
77523 projects

Projects that are alternatives of or similar to ml4ir

SENet-for-Weakly-Supervised-Relation-Extraction
No description or website provided.
Stars: ✭ 39 (-48%)
Mutual labels:  information-retrieval
ProQA
Progressively Pretrained Dense Corpus Index for Open-Domain QA and Information Retrieval
Stars: ✭ 44 (-41.33%)
Mutual labels:  information-retrieval
kex
Kex is a python library for unsupervised keyword extraction from a document, providing an easy interface and benchmarks on 15 public datasets.
Stars: ✭ 46 (-38.67%)
Mutual labels:  information-retrieval
LuceneTutorial
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Stars: ✭ 62 (-17.33%)
Mutual labels:  information-retrieval
COVID19-IRQA
No description or website provided.
Stars: ✭ 32 (-57.33%)
Mutual labels:  information-retrieval
BERT-QE
Code and resources for the paper "BERT-QE: Contextualized Query Expansion for Document Re-ranking".
Stars: ✭ 43 (-42.67%)
Mutual labels:  information-retrieval
perke
A keyphrase extractor for Persian
Stars: ✭ 60 (-20%)
Mutual labels:  information-retrieval
ml-nlp-services
机器学习、深度学习、自然语言处理
Stars: ✭ 23 (-69.33%)
Mutual labels:  information-retrieval
rust-stemmers
A rust implementation of some popular snowball stemming algorithms
Stars: ✭ 85 (+13.33%)
Mutual labels:  information-retrieval
3d model retriever
Experimenting with a newly published deep learning paper and how it can be used for content-based 3D model retrieval. (info retrieval for CAD)
Stars: ✭ 45 (-40%)
Mutual labels:  information-retrieval
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Stars: ✭ 738 (+884%)
Mutual labels:  information-retrieval
netizenship
a commandline #OSINT tool to find the online presence of a username in popular social media websites like Facebook, Instagram, Twitter, etc.
Stars: ✭ 33 (-56%)
Mutual labels:  information-retrieval
EMNLP2020
This is official Pytorch code and datasets of the paper "Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News", EMNLP 2020.
Stars: ✭ 55 (-26.67%)
Mutual labels:  information-retrieval
solr
Apache Solr open-source search software
Stars: ✭ 651 (+768%)
Mutual labels:  information-retrieval
HAR
Code for WWW2019 paper "A Hierarchical Attention Retrieval Model for Healthcare Question Answering"
Stars: ✭ 22 (-70.67%)
Mutual labels:  information-retrieval
ConvDR
Code repo for SIGIR 2021 paper "Few-Shot Conversational Dense Retrieval"
Stars: ✭ 36 (-52%)
Mutual labels:  information-retrieval
naacl2018-fever
Fact Extraction and VERification baseline published in NAACL2018
Stars: ✭ 109 (+45.33%)
Mutual labels:  information-retrieval
GNN-Recommender-Systems
An index of recommendation algorithms that are based on Graph Neural Networks.
Stars: ✭ 505 (+573.33%)
Mutual labels:  information-retrieval
AILA-Artificial-Intelligence-for-Legal-Assistance
Python implementations of the various methods used in FIRE 2019 conference.
Stars: ✭ 39 (-48%)
Mutual labels:  information-retrieval
MixGCF
MixGCF: An Improved Training Method for Graph Neural Network-based Recommender Systems, KDD2021
Stars: ✭ 73 (-2.67%)
Mutual labels:  information-retrieval

ml4ir: Machine Learning for Information Retrieval

CircleCI | changelog

Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe

ml4ir is an open source library for training and deploying deep learning models for search applications. ml4ir is built on top of python3 and tensorflow 2.x for training and evaluation. It also comes packaged with scala utilities for JVM inference.

ml4ir is designed as modular subcomponents which can each be combined and customized to build a variety of search ML models such as:

  • Learning to Rank
  • Query Auto Completion
  • Document Classification
  • Query Classification
  • Named Entity Recognition
  • Top Results
  • Query2SQL
  • add your application here

ml4ir

Motivation

Search is a complex data space with lots of different types of ML tasks working on a combination of structured and unstructured data sources. There existed no single library that

  • provides an end-to-end training and serving solution for a variety of search applications
  • allows training of models with limited coding expertise
  • allows easy customization to build complex models to tackle the search domain
  • focuses on performance and robustness
  • enables fast prototyping

So, we built ml4ir to do all of the above.

Guiding Principles

Customizable Library

Firstly, we want ml4ir to be an easy-to-use and highly customizable library so that you can build the search application of your need. ml4ir allows each of its subcomponents to be overriden, mixed and match with other custom modules to create and deploy models.

Configurable Toolkit

While ml4ir can be used as a library, it also comes prepackaged with all the popular search based losses, metrics, embeddings, layers, etc. to enable someone with limited tensorflow expertise to quickly load their training data and train models for the task of interest. ml4ir achieves this by following a hybrid approach which allow for each subcomponent to be completely controlled through configurations alone. Most search based ML applications can be built this way.

Performance First

ml4ir is built using the TFRecord data pipeline, which is the recommended data format for tensorflow data loading. We combine ml4ir's high configurability with out of the box tensorflow data optimization utilities to define model features and build a data pipeline that easily allows training on huge amounts of data. ml4ir also comes packaged with utilities to convert data from CSV and libsvm format to TFRecord.

Training-Serving Handshake

As ml4ir is a common library for training and serving deep learning models, this allows us to build tight integration and fault tolerance into the models that are trained. ml4ir also uses the same configuration files for both training and inference keeping the end-to-end handshake clean. This allows user's to easily plug in any feature store(or solr) into ml4ir's serving utilities to deploy models in one's production environments.

Search Model Hub

The goal of ml4ir is to form a common hub for the most popular deep learning layers, losses, metrics, embeddings used in the search domain. We've built ml4ir with a focus on quick prototyping with wide variety of network architectures and optimizations. We encourage contributors to add to ml4ir's arsenal of search deep learning utilities as we continue to do so ourselves.

Continuous Integration

We use CircleCI for running tests. Both jvm and python tests will run on each commit and pull request. You can find both the CI pipelines here

Unit test can be run from the Python/Java IDEs directly or with dedictated mvn or python command

For integration test, you need to run, in the jvm directory:

  • mvn verify -Pintegration_tests after enabling your Python environement as described in the python README.md
  • or, if you prefer running the Python training in Docker, mvn verify -Pintegration_tests -DuseDocker

Alternatively, you can abuse the e2e test to test the jvm inference against a custom directory throught this command: mvn test -Dtest=TensorFlowInferenceIT#testRankingSavedModelBundleWithCSVData -DbundleLocation=/path/to/my/trained/model -DrunName=myRunName

Documentation

We use sphinx for ml4ir documentation. The documentation is hosted using Read the Docs at ml4ir.readthedocs.io/en/latest.

For python doc strings, please use the numpy docstring format specified here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].