All Projects → DiceTechJobs → Conceptualsearch

DiceTechJobs / Conceptualsearch

Licence: apache-2.0
Train a Word2Vec model or LSA model, and Implement Conceptual Search\Semantic Search in Solr\Lucene - Simon Hughes Dice.com, Dice Tech Jobs

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Conceptualsearch

Sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (+47.76%)
Mutual labels:  search-engine, information-retrieval, solr
Vectorsinsearch
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015
Stars: ✭ 71 (-71.02%)
Mutual labels:  search-engine, information-retrieval, solr
Lucene Solr
Apache Lucene and Solr open-source search software
Stars: ✭ 4,217 (+1621.22%)
Mutual labels:  search-engine, information-retrieval, solr
solr
Apache Solr open-source search software
Stars: ✭ 651 (+165.71%)
Mutual labels:  search-engine, information-retrieval, solr
Relevancyfeedback
Dice.com's relevancy feedback solr plugin created by Simon Hughes (Dice). Contains request handlers for doing MLT style recommendations, conceptual search, semantic search and personalized search
Stars: ✭ 19 (-92.24%)
Mutual labels:  search-engine, information-retrieval, solr
Imageclassification
Deep Learning: Image classification, feature visualization and transfer learning with Keras
Stars: ✭ 83 (-66.12%)
Mutual labels:  jupyter-notebook, search-engine
Solrplugins
Dice Solr Plugins from Simon Hughes Dice.com
Stars: ✭ 86 (-64.9%)
Mutual labels:  information-retrieval, solr
Ik Analyzer
支持Lucene5/6/7/8+版本, 长期维护。
Stars: ✭ 112 (-54.29%)
Mutual labels:  search-engine, solr
Querqy
Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)
Stars: ✭ 122 (-50.2%)
Mutual labels:  search-engine, solr
Covid 19 Bert Researchpapers Semantic Search
BERT semantic search engine for searching literature research papers for coronavirus covid-19 in google colab
Stars: ✭ 23 (-90.61%)
Mutual labels:  jupyter-notebook, search-engine
Srchx
A standalone lightweight full-text search engine built on top of blevesearch and Go with multiple storage (scorch, boltdb, leveldb, badger)
Stars: ✭ 118 (-51.84%)
Mutual labels:  search-engine, solr
Rated Ranking Evaluator
Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures
Stars: ✭ 134 (-45.31%)
Mutual labels:  search-engine, information-retrieval
Textrank Keyword Extraction
Keyword extraction using TextRank algorithm after pre-processing the text with lemmatization, filtering unwanted parts-of-speech and other techniques.
Stars: ✭ 79 (-67.76%)
Mutual labels:  jupyter-notebook, information-retrieval
Readingbricks
A structured collection of tagged notes about machine learning theory and practice endowed with search infrastructure that allows users to read requested info only.
Stars: ✭ 90 (-63.27%)
Mutual labels:  jupyter-notebook, search-engine
Awesome Solr
A curated list of Awesome Apache Solr links and resources.
Stars: ✭ 69 (-71.84%)
Mutual labels:  search-engine, solr
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+1291.43%)
Mutual labels:  search-engine, information-retrieval
Sf1r Lite
Search Formula-1——A distributed high performance massive data engine for enterprise/vertical search
Stars: ✭ 158 (-35.51%)
Mutual labels:  search-engine, information-retrieval
Tis Solr
an enterprise search engine base on Apache Solr
Stars: ✭ 158 (-35.51%)
Mutual labels:  search-engine, solr
Bm25
A Python implementation of the BM25 ranking function.
Stars: ✭ 159 (-35.1%)
Mutual labels:  search-engine, information-retrieval
Resin
Hardware-accelerated vector-based search engine. Available as a HTTP service or as an embedded library.
Stars: ✭ 529 (+115.92%)
Mutual labels:  search-engine, information-retrieval

DiceTechJobs - Conceptual Search

Dice Tech Jobs - Dice.com's repository for building a 'Conceptual Search Engine', by Simon Hughes ( Dice Data Scientist ). This repository contains Python code for training Thomas Mikolov's Word2Vec model on a set of documents. The output of this process can then be embedded in solr (or some other search engine) using synonym files combined with some solr plug-ins to provide conceptual search functionality within the search engine. The output could also be used within other search engines, provided they support synonym files. Conceptual search is also known as semantic search, and learns to match across concepts in a domain rather than keywords to improve recall.

Please also check out my 'Vectors in Search' repo, which extends this work. It contains links to the slides and video from that talk also.

Links

Description

The scripts include code to pre-process and tokenize documents, extract common terms and phrases based on document frequency, train a word2vec model using the gensim implementation, and cluster the resulting word vectors using sci-kit learn's clustering libraries. The python scripts output a number of solr synonym files which can be used to enable conceptual search functionality within solr when combined with some custom dice solr plugins.

See https://github.com/DiceTechJobs/SolrPlugins for solr plugins to utilize the learned vectors and synonym files within an Apache Solr search engine

See https://github.com/DiceTechJobs/SolrConfigExamples for example solr configuration entries for configuring conceptual search within solr, including setting up the plugins.

The scripts are in the form of Jupyter python notebooks, to be run in order (1,2,3 and any of the 4's), and as separate command line scripts (see below) if you don't want to use Jupyter. The python scripts are cleaner, and share common config files with all required settings, and are designed to be run from the shell, so these are probably easier to start with. These notebooks and scripts will pre-process the documents, and train the Word2Vec model. The ./Settings folder contains example config files for each script, with a description of each setting in the comments. To call the command-line scripts, pass in the related config file as the only paramater, e.g.

python pre_process_documents.py ./Settings/pre_process_documents.cfg

The command line scripts should be run in order:

  1. pre_process_documents.py - this is needed to strip-out some punctuation characters (comma's, hyphens etc), parse html if needed, and separate out the sentences in the document. If you wish to skip this step and move to 2 or 3, provide a set of files to steps 2 and 3 with any punctuation you want removing stripped out, and with every new sentence on a separate line.

  2. extract_keywords.py - (optional) If you don't have a good and extensive set of keyphrases from your domain (e.g. your top 5,000 seach keywords and phrases, phrases being the important part) or you want to increase coverage beyond this list, run this script to extract all keywords and phrases above a specified document frequency threshold.

  3. train_word2vec_model.py - Trains and saves the Word2Vec model on the pre-processed documents from 1. Uses a set of keywords and phrases, such as those output from 2. Please note - This model is very fast, but requires a C compiler to be available and pre-installed to make use of the C version under the covers, otherwise the much slower python implementation is used. If this is unavailable, you will get a run-time warning when the model is first trained.

  4. This step contains multiple files depending on the desired solution (see my talk):

  5. Vector output - COMING SOON! See Jupyter Notebook 4.a

  6. generate_topn_synonyms_file.py - Generates top n synonyms for each target keyword or phrase. This generates 2 files, a file with payloads, and a file without. The simplest use case is to use the file without payloads. Better performance can be gained using the payloads file to weight the synonyms by similarity. This can be done at query time using the queryboost parser. Note that to do this you need to tokenize on commas and whitespace at query time as we replace whitespace with commas to get around the multi-word synonym issue. Alternatively (and recommended) use synonym expansion at index time, along with the PayloadEdismax query parser, the PayloadAwareDefaultSimilarity class (use as default similarity or use schema similarity to configure per field), and ensure the fieldType for these fields contains the term 'payload' or 'vector'.

  7. generate_cluster_synonyms_file.py - Generates k clusters from the word vectors generated in the previous steps. These can be embedded directly in solr via a synonym file - no special plugins needed. I'd recommend generating a number of different clusters of synonyms of varying sizes, and configure these as separate fields with higher field weights applied to the small clusters (i.e. generated with a larger k value).

Required Python libraries:

  • nltk (for sentence tokenizer in pre-processing file)
  • beautiful-soup (for html parsing in pre-processing script)
  • numpy
  • gensim (for Word2Vec implementation)
  • scikit-learn (only needed for clustering)
  • jupyter (to use the notebooks - jupyter is the new name for ipython)

Built using python 2.7.10. Untested with python 3

Word2Vec

The Word2Vec implementation is that of the excellent gensim package. Contains fast implementations of LSA, LDA, Word2Vec and some other machine learning algorithms.

https://radimrehurek.com/gensim/models/word2vec.html

This is a great package for topic modelling, and learning semantic representations of documents and words.

Using Pre-Trained Word Vectors

Google released a set of pre-trained word vectors, trained on a 100 billion words of the google news corpus. For those of you that aren't focused in a specialized domain but on a very broad set of documents, such as companies building a news search engine (like Reuters, Bloomberg, Governmental agencies, etc) you can just use this pre-trained model instead. Then you can skip the first 3 steps, and go directly to using one of the step 4 scripts above that take a pre-trained model and compute output synonym files, and that's all you should need. This post describes where to get the pre-trained vectors: https://groups.google.com/forum/#!topic/gensim/_XLEbmoqVCg. You can then use gensim's Word2Vec's model.load functionality:

model = Word2Vec.load(MODEL_FILE)

A Note on Synonym File Sizes

If you are using Solr cloud, Zookeeper does not like any config files to be over 1M in size. So if your resulting synonym files are larger than this, you will either have to either 1) change the default zookeeper settings, 2) split the synonym file into mutliple files and apply synonym filters in sequence, or 3) load synonyms from a database using a plugin (e.g. https://github.com/shopping24/solr-jdbc-synonyms)

Stanford's Glove Vectors

Stanford's NLP boffins developed a competing word vector learning algorithm to Word2Vec with similar accuracy. If you want to experiment with that, this python package will allow you to do so: https://github.com/hans/glove.py I haven't however tried that so I can't vouch for it at this time.

Input File Format

The intial scripts expect a folder containing raw *.txt or html files. If you have html content, there is logic inside the scripts to parse the html, but beautiful soup can be a bit flaky, so you may be better pre-parsing them first before pushing them through the pipeline. Note that there is no special file format, which seems to be the issue most people have when trying to run this script. If it's erroring out loading the files, I'd suggest using the python script not the notebook and cracking open the debugger to see what's happening. Also, make sure you set the config.file_mask to match the files you want to load, in https://github.com/DiceTechJobs/ConceptualSearch/blob/master/Settings/pre_process_documents.cfg. This defaults to .*.txt (it's a regex not a file blob), so you will need to change this if you files are not .txt files.

Troubleshooting \ Errors

Please post any questions, bugs or feature requests to the issues list, and include an @mention - @simonhughes22 so I'll get a timely email with your questions. I have had a few people email me directly with questions about this repo. While I don't mind responding to emails, Please instead submit a GitHub issue and @mention me. That way, everyone else can see the question and my response for future reference.

Other Tools / Repos

I recently gave a talk on Vector Search, code is below. This is a natural extension of the conceptual search work I did

Please check out our Solr plugins:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].