All Projects → DiceTechJobs → Vectorsinsearch

DiceTechJobs / Vectorsinsearch

Licence: apache-2.0
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Vectorsinsearch

Ik Analyzer
支持Lucene5/6/7/8+版本, 长期维护。
Stars: ✭ 112 (+57.75%)
Mutual labels:  search-engine, solr, lucene, elasticsearch
Lucene Solr
Apache Lucene and Solr open-source search software
Stars: ✭ 4,217 (+5839.44%)
Mutual labels:  search-engine, information-retrieval, solr, lucene
solr
Apache Solr open-source search software
Stars: ✭ 651 (+816.9%)
Mutual labels:  search-engine, information-retrieval, solr, lucene
Querqy
Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)
Stars: ✭ 122 (+71.83%)
Mutual labels:  search-engine, solr, lucene
Relevancyfeedback
Dice.com's relevancy feedback solr plugin created by Simon Hughes (Dice). Contains request handlers for doing MLT style recommendations, conceptual search, semantic search and personalized search
Stars: ✭ 19 (-73.24%)
Mutual labels:  search-engine, information-retrieval, solr
Srchx
A standalone lightweight full-text search engine built on top of blevesearch and Go with multiple storage (scorch, boltdb, leveldb, badger)
Stars: ✭ 118 (+66.2%)
Mutual labels:  search-engine, solr, elasticsearch
Springboot Templates
springboot和dubbo、netty的集成,redis mongodb的nosql模板, kafka rocketmq rabbit的MQ模板, solr solrcloud elasticsearch查询引擎
Stars: ✭ 100 (+40.85%)
Mutual labels:  solr, lucene, elasticsearch
Conceptualsearch
Train a Word2Vec model or LSA model, and Implement Conceptual Search\Semantic Search in Solr\Lucene - Simon Hughes Dice.com, Dice Tech Jobs
Stars: ✭ 245 (+245.07%)
Mutual labels:  search-engine, information-retrieval, solr
Rated Ranking Evaluator
Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures
Stars: ✭ 134 (+88.73%)
Mutual labels:  search-engine, information-retrieval, elasticsearch
navec
Compact high quality word embeddings for Russian language
Stars: ✭ 118 (+66.2%)
Mutual labels:  word2vec, glove, quantization
SolrConfigExamples
Examples of Solr configuration entries for Solr plugins and Conceptual Search\Semantic Search from Simon Hughes Dice.com
Stars: ✭ 26 (-63.38%)
Mutual labels:  information-retrieval, solr, lucene
RelevancyTuning
Dice.com tutorial on using black box optimization algorithms to do relevancy tuning on your Solr Search Engine Configuration from Simon Hughes Dice.com
Stars: ✭ 28 (-60.56%)
Mutual labels:  information-retrieval, solr, lucene
Fast Elasticsearch Vector Scoring
Score documents using embedding-vectors dot-product or cosine-similarity with ES Lucene engine
Stars: ✭ 304 (+328.17%)
Mutual labels:  vector, lucene, elasticsearch
Solrplugins
Dice Solr Plugins from Simon Hughes Dice.com
Stars: ✭ 86 (+21.13%)
Mutual labels:  information-retrieval, solr, lucene
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+4701.41%)
Mutual labels:  search-engine, information-retrieval, elasticsearch
Code4java
Repository for my java projects.
Stars: ✭ 164 (+130.99%)
Mutual labels:  solr, lucene, elasticsearch
lucene
Apache Lucene open-source search software
Stars: ✭ 1,009 (+1321.13%)
Mutual labels:  search-engine, information-retrieval, lucene
Sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (+409.86%)
Mutual labels:  search-engine, information-retrieval, solr
Fess
Fess is very powerful and easily deployable Enterprise Search Server.
Stars: ✭ 561 (+690.14%)
Mutual labels:  search-engine, lucene, elasticsearch
Funpyspidersearchengine
Word2vec 千人千面 个性化搜索 + Scrapy2.3.0(爬取数据) + ElasticSearch7.9.1(存储数据并提供对外Restful API) + Django3.1.1 搜索
Stars: ✭ 782 (+1001.41%)
Mutual labels:  search-engine, elasticsearch

Vectors in Search

Dice.com code for implementing the ideas discussed in the following talks:

This extends my earlier work on 'Conceptual Search' which can be found here - https://github.com/DiceTechJobs/ConceptualSearch (including slides and video links). In this talk, I present a number of different approaches for searching vectors at scale using an inverted index. This implements approaches to Approximate k-Nearest Neighbor Search including:

  • LSH (using the Sim Hash)
  • K-Means Tree
  • Vector Thresholding

and describes how these ideas can be implemented and queried efficiently within an inverted index.

UPDATE: After talking with Trey Grainger and Erik Hatcher from LucidWorks, they recommended using term frequency in place of payloads for the solutions where I embed term weights into the index and use a special payload aware similarity function (which would also not be needed). Payloads incur a significant performance penalty. The challenge with this is the negative weights, I assume it is not possible to encode negative term frequencies, but this can be worked around by having different tokens for positive and negative weighted tokens, and making similar adjustments at query time (where negative boosts can be applied in Solr as needed).

Lucene Documentation: Lucene Delimited Term Frequency Filter

There has also been a recent update to Lucene core that is applicable here and is soon to make it's way into Elastic search at time of writing: Block Max WAND. This produces a signifcant speed up for large boolean OR queries where you don't need to know the exact number of results but just care about getting the top-N results as fast as possible. All of the approaches I discuss here generate relatively large OR queries and so this is very relevant. I have also read that the current implementation of minimum-should-match also includes similar optimizations, and so the same sort of performance gain may already be attained using appropriate mm settings, something that I was already experimenting with in my code.

Directory Structure

  • python
    • Code for implementing the k-means tree, LSH sim hash and vector thresholding algorithms, and indexing and searching vectors in solr using these techniques.
  • solr_plugins
    • Java code for implementing the custom similarity classes and payloadEdismax parser described in the talk.
  • solr_configs
    • Xml snippets for importing the solr plugins from the 'solr_vectors_in_search_plugins' java code.

Implementation Details

  • Solr Version - 7.5
  • Python Version - 3.x+ (3.5 used)

Links to Talks

Author

Simon Hughes ( Chief Data Scientist, Dice.com )

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].