All Projects → Hironsan → Bertsearch

Hironsan / Bertsearch

Licence: mit
Elasticsearch with BERT for advanced document search.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Bertsearch

Mahuta
IPFS Storage service with search capability
Stars: ✭ 185 (-72.95%)
Mutual labels:  search-engine, elasticsearch
Toshi
A full-text search engine in rust
Stars: ✭ 3,373 (+393.13%)
Mutual labels:  search-engine, elasticsearch
Image To Image Search
A reverse image search engine powered by elastic search and tensorflow
Stars: ✭ 200 (-70.76%)
Mutual labels:  search-engine, elasticsearch
Elasticsuite
Smile ElasticSuite - Magento 2 merchandising and search engine built on ElasticSearch
Stars: ✭ 647 (-5.41%)
Mutual labels:  search-engine, elasticsearch
Elasticsearch
The missing elasticsearch ORM for Laravel, Lumen and Native php applications
Stars: ✭ 375 (-45.18%)
Mutual labels:  search-engine, elasticsearch
Covid Papers Browser
Browse Covid-19 & SARS-CoV-2 Scientific Papers with Transformers 🦠 📖
Stars: ✭ 161 (-76.46%)
Mutual labels:  search-engine, natural-language-processing
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (-60.09%)
Mutual labels:  elasticsearch, natural-language-processing
Ik Analyzer
支持Lucene5/6/7/8+版本, 长期维护。
Stars: ✭ 112 (-83.63%)
Mutual labels:  search-engine, elasticsearch
Awesome Search
Awesome Search - this is all about the (e-commerce) search and its awesomeness
Stars: ✭ 361 (-47.22%)
Mutual labels:  search-engine, natural-language-processing
Xapiand
Xapiand: A RESTful Search Engine
Stars: ✭ 347 (-49.27%)
Mutual labels:  search-engine, elasticsearch
Rated Ranking Evaluator
Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures
Stars: ✭ 134 (-80.41%)
Mutual labels:  search-engine, elasticsearch
Nboost
NBoost is a scalable, search-api-boosting platform for deploying transformer models to improve the relevance of search results on different platforms (i.e. Elasticsearch)
Stars: ✭ 549 (-19.74%)
Mutual labels:  search-engine, elasticsearch
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+398.39%)
Mutual labels:  search-engine, elasticsearch
Rusticsearch
Lightweight Elasticsearch compatible search server.
Stars: ✭ 171 (-75%)
Mutual labels:  search-engine, elasticsearch
Srchx
A standalone lightweight full-text search engine built on top of blevesearch and Go with multiple storage (scorch, boltdb, leveldb, badger)
Stars: ✭ 118 (-82.75%)
Mutual labels:  search-engine, elasticsearch
Elasticsearch
Free and Open, Distributed, RESTful Search Engine
Stars: ✭ 57,778 (+8347.08%)
Mutual labels:  search-engine, elasticsearch
Gnes
GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.
Stars: ✭ 1,178 (+72.22%)
Mutual labels:  search-engine, elasticsearch
Thesaurus Of Job Titles
Open Source Thesaurus of Job Titles in US English
Stars: ✭ 77 (-88.74%)
Mutual labels:  search-engine, elasticsearch
Adam qas
ADAM - A Question Answering System. Inspired from IBM Watson
Stars: ✭ 330 (-51.75%)
Mutual labels:  elasticsearch, natural-language-processing
Ipfs Search
Search engine for the Interplanetary Filesystem.
Stars: ✭ 519 (-24.12%)
Mutual labels:  search-engine, elasticsearch

Elasticsearch meets BERT

Below is a job search example:

An example of bertsearch

System architecture

System architecture

Requirements

  • Docker
  • Docker Compose >= 1.22.0

Getting Started

1. Download a pretrained BERT model

List of released pretrained BERT models (click to expand...)
BERT-Base, Uncased 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased 12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New) 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
$ wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
$ unzip cased_L-12_H-768_A-12.zip

2. Set environment variables

You need to set a pretrained BERT model and Elasticsearch's index name as environment variables:

$ export PATH_MODEL=./cased_L-12_H-768_A-12
$ export INDEX_NAME=jobsearch

3. Run Docker containers

$ docker-compose up

CAUTION: If possible, assign high memory(more than 8GB) to Docker's memory configuration because BERT container needs high memory.

4. Create index

You can use the create index API to add a new index to an Elasticsearch cluster. When creating an index, you can specify the following:

  • Settings for the index
  • Mappings for fields in the index
  • Index aliases

For example, if you want to create jobsearch index with title, text and text_vector fields, you can create the index by the following command:

$ python example/create_index.py --index_file=example/index.json --index_name=jobsearch
# index.json
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic": "true",
    "_source": {
      "enabled": "true"
    },
    "properties": {
      "title": {
        "type": "text"
      },
      "text": {
        "type": "text"
      },
      "text_vector": {
        "type": "dense_vector",
        "dims": 768
      }
    }
  }
}

CAUTION: The dims value of text_vector must need to match the dims of a pretrained BERT model.

5. Create documents

Once you created an index, you’re ready to index some document. The point here is to convert your document into a vector using BERT. The resulting vector is stored in the text_vector field. Let`s convert your data into a JSON document:

$ python example/create_documents.py --data=example/example.csv --index_name=jobsearch
# example/example.csv
"Title","Description"
"Saleswoman","lorem ipsum"
"Software Developer","lorem ipsum"
"Chief Financial Officer","lorem ipsum"
"General Manager","lorem ipsum"
"Network Administrator","lorem ipsum"

After finishing the script, you can get a JSON document like follows:

# documents.jsonl
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Saleswoman", "text_vector": [...]}
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Software Developer", "text_vector": [...]}
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Chief Financial Officer", "text_vector": [...]}
...

6. Index documents

After converting your data into a JSON, you can adds a JSON document to the specified index and makes it searchable.

$ python example/index_documents.py

7. Open browser

Go to http://127.0.0.1:5000.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].