All Projects → DiceTechJobs → Solrplugins

DiceTechJobs / Solrplugins

Licence: apache-2.0
Dice Solr Plugins from Simon Hughes Dice.com

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Solrplugins

RelevancyTuning
Dice.com tutorial on using black box optimization algorithms to do relevancy tuning on your Solr Search Engine Configuration from Simon Hughes Dice.com
Stars: ✭ 28 (-67.44%)
Mutual labels:  information-retrieval, solr, lucene
solr
Apache Solr open-source search software
Stars: ✭ 651 (+656.98%)
Mutual labels:  information-retrieval, solr, lucene
SolrConfigExamples
Examples of Solr configuration entries for Solr plugins and Conceptual Search\Semantic Search from Simon Hughes Dice.com
Stars: ✭ 26 (-69.77%)
Mutual labels:  information-retrieval, solr, lucene
Vectorsinsearch
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015
Stars: ✭ 71 (-17.44%)
Mutual labels:  information-retrieval, solr, lucene
Lucene Solr
Apache Lucene and Solr open-source search software
Stars: ✭ 4,217 (+4803.49%)
Mutual labels:  information-retrieval, solr, lucene
LuceneTutorial
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Stars: ✭ 62 (-27.91%)
Mutual labels:  information-retrieval, lucene
jease
Jease is a Java CMS framework based on Object Database
Stars: ✭ 25 (-70.93%)
Mutual labels:  solr, lucene
jstarcraft-nlp
专注于解决自然语言处理领域的几个核心问题:词法分析,句法分析,语义分析,语种检测,信息抽取,文本聚类和文本分类. 为相关领域的研发人员提供完整的通用设计与参考实现. 涵盖了多种自然语言处理算法,适配了多个自然语言处理框架. 兼容Lucene/Solr/ElasticSearch插件.
Stars: ✭ 92 (+6.98%)
Mutual labels:  solr, lucene
nlpir-analysis-cn-ictclas
Lucene/Solr Analyzer Plugin. Support MacOS,Linux x86/64,Windows x86/64. It's a maven project, which allows you change the lucene/solr version. //Maven工程,修改Lucene/Solr版本,以兼容相应版本。
Stars: ✭ 71 (-17.44%)
Mutual labels:  solr, lucene
lucene
Apache Lucene open-source search software
Stars: ✭ 1,009 (+1073.26%)
Mutual labels:  information-retrieval, lucene
Ik Analyzer Solr
ik-analyzer for solr 7.x-8.x
Stars: ✭ 1,017 (+1082.56%)
Mutual labels:  solr, lucene
Hanlp Lucene Plugin
HanLP中文分词Lucene插件,支持包括Solr在内的基于Lucene的系统
Stars: ✭ 272 (+216.28%)
Mutual labels:  solr, lucene
Sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (+320.93%)
Mutual labels:  information-retrieval, solr
Conceptualsearch
Train a Word2Vec model or LSA model, and Implement Conceptual Search\Semantic Search in Solr\Lucene - Simon Hughes Dice.com, Dice Tech Jobs
Stars: ✭ 245 (+184.88%)
Mutual labels:  information-retrieval, solr
solr-container
Ansible Container project that manages the lifecycle of Apache Solr on Docker.
Stars: ✭ 17 (-80.23%)
Mutual labels:  solr, lucene
Code4java
Repository for my java projects.
Stars: ✭ 164 (+90.7%)
Mutual labels:  solr, lucene
Fxdesktopsearch
A JavaFX based desktop search application.
Stars: ✭ 147 (+70.93%)
Mutual labels:  solr, lucene
Ik Analyzer
支持Lucene5/6/7/8+版本, 长期维护。
Stars: ✭ 112 (+30.23%)
Mutual labels:  solr, lucene
Querqy
Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)
Stars: ✭ 122 (+41.86%)
Mutual labels:  solr, lucene
Anserini
A Lucene toolkit for replicable information retrieval research
Stars: ✭ 573 (+566.28%)
Mutual labels:  information-retrieval, lucene

DiceTechJobs - Solr Plugins

A Dice Tech Job repository for Dice.com's Solr plugins. Most extend or build upon the core solr and lucene libraries (kudos to the original contributors and the ASF) with additional functionality we've found useful for certain tasks. This extends upon the original solr and lucene source code (version 4.6.1) so please note the APACHE license. Please see the branches for versions built against different solr versions, including 6.3

Documentation

Please see the [email protected] with any questions you have.

Included Functionality:

  • Plugins necessary for Conceptual Search Implementation (see Lucene Revolution 2015 talk - http://lucenerevolution.org/sessions/implementing-conceptual-search-in-solr-using-lsa-and-word2vec/)
    • Custom query parsers: VectorQParser (for handling dense vector fields), QueryBoostingQParser (weighted synonym term expansion at query time)
      • important: these query handlers handle the solr multi-word synonym problem by replacing spaces with comma's before query analysis. Your query analysis pipeline for these fields must tokenize on commas as well as spaces.
    • Custom token filters - MeanPayloadTokenFilter (averages payloads over duplicate terms), PayloadQueryBoostTokenFilter (turns a payload in a synonym file into a term boost at query time)
    • See also https://github.com/DiceTechJobs/SolrConfigExamples for example solr xml files
    • See also https://github.com/DiceTechJobs/ConceptualSearch for python scripts to extract common keywords and phrases, train the word2vec model and cluster the resulting word vectors.
  • PayloadAwareExtendedDismaxQParserPlugin
    • Extension of the edismax query parser that includes the mean payload score over each term in addition to term frequency and document frequency when computing a relevancy score. Allows application of a per term weighting at index time so you can apply your own weightings to the same term differently depending on the document, for instance if using a 'learning to rank' approach to improve relevancy, or some implementation of probabilistic information retrieval.
    • Requires a custom similarity class implementation to be payload aware, e.g. dice's PayloadAwareDefaultSimilarity
    • important only utilizes payloads for fields which have a field type name that contains 'payload' or 'vector'
  • Custom Similarity Classes
    • Use <similarity class="solr.SchemaSimilarityFactory"/> in schema.xml to configure per field similarity class overrides
    • Custom classes include (see https://github.com/DiceTechJobs/SolrPlugins/tree/master/src/main/java/org/dice/solrenhancements/similarity for full list):
      • PayloadAwareDefaultSimilarity - DefaultSimilarity class extended to include payloads in scoring function
      • NoLengthNormSimilarity - remove all length norms from scoring function - useful for very short fields, such as job titles
      • PayloadOnlySimilarity - only score terms on payloads. Useful when building a custom relevancy calculation where you want to disable field norms and tf and idf weightings (such as storing a vector field for conceptual search, or building a recommender system where you want to embed your own term weights from a machine learning model)
  • Custom Token Filters
    • TypeEraseFilter - erases the type field value from the tokens in an analysis chain. Useful if applying several sets of synonym filters, and you want to use only some of these filters to filter the resulting tokens with a TypeTokenFilterFactory
    • ConstantTokenFilter - emits a constant token for each token in the token stream. Useful for doing things like counting certain token types - use a synonym filter plus a TypeTokenFilter to filter to certain tokens, and then a ConstantTokenFilter to allow counting or boosting by the number of tokens using the termfreq() function at query time (or apply a negative boost using the count and the div function).
  • Dice custom MLT Handler
    • Allows top n terms per field, rather than across all fields specified
    • Fleshes out 'more like these' functionality - use multiple target documents to generate recommendations
    • boost query support - supports multiplicative boosts for matching items, for instance boost by relevancy and proximity to your user
    • Better support for content streams in place of source documents to generate recommendations from
  • Unsupervised Feedback Handler (a.k.a. blind feedback \ pseudo-relevancy feedback)
    • Implements a well-researched methodology from the field of information retrieval for improving relevancy. Also known as 'blind feedback' and 'pseudo-relevancy feedback'.
    • Uses code based on the custom MLT handler to execute each query twice. The first execution uses the MLT code to grab the top terms for the result set by their tf.idf values. It then adds these terms to the original query (term expansion) and re-executes.
    • This 2 phase execution happens inside of solr (one round trip) and so has a negilible impact on response time for most queries while noticeably improving relevancy.
  • DiceSuggester
    • A suggester that allows you to use one field type to source suggestions, and a separate field type to transform the matching suggestions into some other string using an analysis chain. For instance, we use it to map all variants of a skill into a canonical form (e.g. hadoop=>"Apahce Hadoop") before returning the suggestions.
    • It allows different field types to be used to process matching suggestions (suggestionAnalyzerFieldTypeName), such as applying synonyms, stemming, etc. This is necessary for applying the transformation, as the spellchecker needs to store the raw un modified tokens to do the auto-complete.
    • It also ensures all suggestions generated are UNIQUE.
    • Requires a comma-delimited set of files containing phrase counts (param - sourceLocation). These are phrases from the transformed field, along with their counts. See SolrConfigExamples - skillSuggest configuration.
  • DiceMultipleCaseSuggester
    • Solr suggester modification - can handle UPPER, lower and Title Case variations for type ahead.
    • Regular solr suggester functionality is case sensitive.
    • See SolrConfigExamples - titleSuggest configuration.
  • DiceSpellCheckComponent and DiceDirectSolrSpellChecker
    • Regular solr spell check component can only search for corrections within 2 edit distances of each query term
    • This extends this functionality to allow you to embed a file of common user typos that will take precedence over the edit distance matches.
    • Allows you to data-mine common typos that go beyond an edit distance of 2 and inject them into your spellchecker, or override common bad spellchecking suggestions.

Should be compatible with solr versions 4+ and 5+ and 6+. Please contact us via the issues list in this repository with any questions, bug reports, feedback or feature requests.

See Also

Please also check out the other Solr-related DiceTechJobs repositories:

  • SolrConfigExamples - Example solr configurations for using the functionality in the plugins.
  • ConceptualSearch - Dice's implementation of Conceptual or Semantic search, for use in solr using Word2Vec.
  • RelevancyTuning - Automatic approach to relevancy tuning your Solr configuration. Uses reinforcement learning and evolutionary algorithms to evolve an optimal solr configuration (field boosts, tie parameter setting, query handler, etc).
  • RelevancyFeedback - Slightly updated (and renamed) version of the MLT handler and unsupervised feedback handler that handle personalized search scenarios.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].