All Projects → md-experiments → elastic_transformers

md-experiments / elastic_transformers

Licence: Apache-2.0 license
Making BERT stretchy. Semantic Elasticsearch with Sentence Transformers

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to elastic transformers

Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+2128.1%)
Mutual labels:  transformers, semantic-search
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-85.62%)
Mutual labels:  transformers, sentence-transformers
Introduction-to-Deep-Learning-and-Neural-Networks-Course
Code snippets and solutions for the Introduction to Deep Learning and Neural Networks Course hosted in educative.io
Stars: ✭ 33 (-78.43%)
Mutual labels:  transformers
mlconjug3
A Python library to conjugate verbs in French, English, Spanish, Italian, Portuguese and Romanian (more soon) using Machine Learning techniques.
Stars: ✭ 47 (-69.28%)
Mutual labels:  nlp-machine-learning
code-transformer
Implementation of the paper "Language-agnostic representation learning of source code from structure and context".
Stars: ✭ 130 (-15.03%)
Mutual labels:  transformers
ShortText-Fasttext
ShortText classification
Stars: ✭ 12 (-92.16%)
Mutual labels:  nlp-machine-learning
Chinese-Minority-PLM
CINO: Pre-trained Language Models for Chinese Minority (少数民族语言预训练模型)
Stars: ✭ 133 (-13.07%)
Mutual labels:  transformers
Naive-Bayes-Evening-Workshop
Companion code for Introduction to Python for Data Science: Coding the Naive Bayes Algorithm evening workshop
Stars: ✭ 23 (-84.97%)
Mutual labels:  nlp-machine-learning
transformer generalization
The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.
Stars: ✭ 58 (-62.09%)
Mutual labels:  transformers
lidtk
Language Identification Toolkit
Stars: ✭ 17 (-88.89%)
Mutual labels:  nlp-machine-learning
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+49.67%)
Mutual labels:  transformers
awesome-huggingface
🤗 A list of wonderful open-source projects & applications integrated with Hugging Face libraries.
Stars: ✭ 436 (+184.97%)
Mutual labels:  transformers
Ask2Transformers
A Framework for Textual Entailment based Zero Shot text classification
Stars: ✭ 102 (-33.33%)
Mutual labels:  transformers
Conditional-SeqGAN-Tensorflow
Conditional Sequence Generative Adversarial Network trained with policy gradient, Implementation in Tensorflow
Stars: ✭ 47 (-69.28%)
Mutual labels:  nlp-machine-learning
kex
Kex is a python library for unsupervised keyword extraction from a document, providing an easy interface and benchmarks on 15 public datasets.
Stars: ✭ 46 (-69.93%)
Mutual labels:  nlp-machine-learning
text-classification-transformers
Easy text classification for everyone : Bert based models via Huggingface transformers (KR / EN)
Stars: ✭ 32 (-79.08%)
Mutual labels:  transformers
ginza-transformers
Use custom tokenizers in spacy-transformers
Stars: ✭ 15 (-90.2%)
Mutual labels:  transformers
transformers-lightning
A collection of Models, Datasets, DataModules, Callbacks, Metrics, Losses and Loggers to better integrate pytorch-lightning with transformers.
Stars: ✭ 45 (-70.59%)
Mutual labels:  transformers
BottleneckTransformers
Bottleneck Transformers for Visual Recognition
Stars: ✭ 231 (+50.98%)
Mutual labels:  transformers
pysentimiento
A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks
Stars: ✭ 274 (+79.08%)
Mutual labels:  transformers

ElasticTransformers

Semantic Elasticsearch with Sentence Transformers. We will use the power of Elastic and the magic of BERT to index a million articles and perform lexical and semantic search on them.

The purpose is to provide an ease-of-use way of setting up your own Elasticsearch with near state of the art capabilities of contextual embeddings / semantic search using NLP transformers.

Overview

The above setup works as follows

  • Set up an Elasticsearch server with Dockers
  • Collect the dataset
  • Use sentence-transformers to index them onto Elastic (takes about 3 hrs on 4 CPU cores)
  • Look at some comparison examples between lexical and semantic search

Setup

Set up your environment

My environment is called et and I use conda for this. Navigate inside the project directory

conda create --name et python=3.7  
conda install -n et nb_conda_kernels
conda activate et
pip install -r requirements.txt

Get the data

For this tutorial I am using A Million News Headlines by Rohk and place it in the data folder inside the project dir.

    elastic_transformers/
    ├── data/

You will find that the steps are otherwise pretty abstracted so you can also do this with your dataset of choice

Elasticsearch with Docker

Follow the instructions on setting up Elastic with Docker from Elastic's page here For this tutorial, you only need to run the two steps:

Features

The repo introduces the ElasiticTransformers class. Utilities which help create, index and query Elasticsearch indices which include embeddings

Initiate the connection links as well as (optionally) the name of the index to work with

et=ElasticTransformers(url='http://localhost:9300',index_name='et-tiny')

create_index_spec define mapping for the index. Lists of relevant fields can be provided for keyword search or semantic (dense vector) search. It also has parameters for the size of the dense vector as those can vary create_index - uses the spec created earlier to create an index ready for search

et.create_index_spec(
    text_fields=['publish_date','headline_text'],
    dense_fields=['headline_text_embedding'],
    dense_fields_dim=768
)
et.create_index()

write_large_csv - breaks up a large csv file into chunks and iteratively uses a predefined embedding utility to create the embeddings list for each chunk and subsequently feed results to the index

et.write_large_csv('data/tiny_sample.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')

search - allows to select either keyword (‘match’ in Elastic) or semantic (dense in Elastic) search. Notably it requires the same embedding function used in write_large_csv

et.search(query='search these terms',
          field='headline_text',
          type='match',
          embedder=embed_wrapper, 
          size = 1000)

Usage

After successful setup, use the folling notebooks to make this all work

References

This repo combines together the following amazing works by brilliant people. Please check out their work if you haven't done so yet...

The ML part

The engineering part

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].