Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → hankcs → Hanlp Lucene Plugin

hankcs / Hanlp Lucene Plugin

Licence: apache-2.0

HanLP中文分词Lucene插件，支持包括Solr在内的基于Lucene的系统

Programming Languages

java

68154 projects - #9 most used programming language

Labels

nlp solr lucene

Projects that are alternatives of or similar to Hanlp Lucene Plugin

RelevancyTuning

Dice.com tutorial on using black box optimization algorithms to do relevancy tuning on your Solr Search Engine Configuration from Simon Hughes Dice.com

Stars: ✭ 28 (-89.71%)

Mutual labels: solr, lucene

Fxdesktopsearch

A JavaFX based desktop search application.

Stars: ✭ 147 (-45.96%)

Mutual labels: solr, lucene

Springboot Templates

springboot和dubbo、netty的集成，redis mongodb的nosql模板， kafka rocketmq rabbit的MQ模板， solr solrcloud elasticsearch查询引擎

Stars: ✭ 100 (-63.24%)

Mutual labels: solr, lucene

Vectorsinsearch

Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015

Stars: ✭ 71 (-73.9%)

Mutual labels: solr, lucene

solr-container

Ansible Container project that manages the lifecycle of Apache Solr on Docker.

Stars: ✭ 17 (-93.75%)

Mutual labels: solr, lucene

Solrplugins

Dice Solr Plugins from Simon Hughes Dice.com

Stars: ✭ 86 (-68.38%)

Mutual labels: solr, lucene

Querqy

Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)

Stars: ✭ 122 (-55.15%)

Mutual labels: solr, lucene

Ik Analyzer

支持Lucene5/6/7/8+版本, 长期维护。

Stars: ✭ 112 (-58.82%)

Mutual labels: solr, lucene

jease

Jease is a Java CMS framework based on Object Database

Stars: ✭ 25 (-90.81%)

Mutual labels: solr, lucene

solr

Apache Solr open-source search software

Stars: ✭ 651 (+139.34%)

Mutual labels: solr, lucene

Ik Analyzer Solr

ik-analyzer for solr 7.x-8.x

Stars: ✭ 1,017 (+273.9%)

Mutual labels: solr, lucene

nlpir-analysis-cn-ictclas

Lucene/Solr Analyzer Plugin. Support MacOS,Linux x86/64,Windows x86/64. It's a maven project, which allows you change the lucene/solr version. //Maven工程，修改Lucene/Solr版本，以兼容相应版本。

Stars: ✭ 71 (-73.9%)

Mutual labels: solr, lucene

Lucene Solr

Apache Lucene and Solr open-source search software

Stars: ✭ 4,217 (+1450.37%)

Mutual labels: solr, lucene

Jeeplatform

一款企业信息化开发基础平台，拟集成OA(办公自动化)、CMS(内容管理系统)等企业系统的通用业务功能 JeePlatform项目是一款以SpringBoot为核心框架，集ORM框架Mybatis，Web层框架SpringMVC和多种开源组件框架而成的一款通用基础平台，代码已经捐赠给开源中国社区

Stars: ✭ 1,285 (+372.43%)

Mutual labels: solr, lucene

Code4java

Repository for my java projects.

Stars: ✭ 164 (-39.71%)

Mutual labels: solr, lucene

jstarcraft-nlp

专注于解决自然语言处理领域的几个核心问题:词法分析,句法分析,语义分析,语种检测,信息抽取,文本聚类和文本分类. 为相关领域的研发人员提供完整的通用设计与参考实现. 涵盖了多种自然语言处理算法,适配了多个自然语言处理框架. 兼容Lucene/Solr/ElasticSearch插件.

Stars: ✭ 92 (-66.18%)

Mutual labels: solr, lucene

SolrConfigExamples

Examples of Solr configuration entries for Solr plugins and Conceptual Search\Semantic Search from Simon Hughes Dice.com

Stars: ✭ 26 (-90.44%)

Mutual labels: solr, lucene

conciliator

OpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.

Stars: ✭ 95 (-65.07%)

Mutual labels: solr

lib

Perl Utility Library for my other repos

Stars: ✭ 16 (-94.12%)

Mutual labels: solr

bitnami-docker-solr

Bitnami Docker Image for Solr

Stars: ✭ 33 (-87.87%)

Mutual labels: solr

View All Similar Projects ➔

hanlp-lucene-plugin

HanLP中文分词Lucene插件

基于HanLP，支持包括Solr（7.x）在内的任何基于Lucene（7.x）的系统。

Maven

    <dependency>
      <groupId>com.hankcs.nlp</groupId>
      <artifactId>hanlp-lucene-plugin</artifactId>
      <version>1.1.7</version>
    </dependency>

Solr快速上手

将hanlp-portable.jar和hanlp-lucene-plugin.jar共两个jar放入${webapp}/WEB-INF/lib下。（或者使用mvn package对源码打包，拷贝target/hanlp-lucene-plugin-x.x.x.jar到${webapp}/WEB-INF/lib下）
修改solr core的配置文件${core}/conf/schema.xml：

  <fieldType name="text_cn" class="solr.TextField">
      <analyzer type="index">
          <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>
      </analyzer>
      <analyzer type="query">
          <!-- 切记不要在query中开启index模式 -->
          <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="false"/>
      </analyzer>
  </fieldType>
  <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
  <field name="my_field1" type="text_cn" indexed="true" stored="true"/>
  <field name="my_field2" type="text_cn" indexed="true" stored="true"/>

如果你的业务系统中有其他字段，比如location，summary之类，也需要一一指定其type="text_cn"。切记，否则这些字段仍旧是solr默认分词器。
另外，切记不要在query中开启indexMode，否则会影响PhaseQuery。indexMode只需在index中开启一遍即可。

高级配置

目前本插件支持如下基于schema.xml的配置:

配置项名称	功能	默认值
algorithm	分词算法	viterbi
enableIndexMode	设为索引模式（切勿在query中开启）	true
enableCustomDictionary	是否启用用户词典	true
customDictionaryPath	用户词典路径(绝对路径或程序可以读取的相对路径,多个词典用空格隔开)	null
enableCustomDictionaryForcing	用户词典高优先级	false
stopWordDictionaryPath	停用词词典路径	null
enableNumberQuantifierRecognize	是否启用数词和数量词识别	true
enableNameRecognize	开启人名识别	true
enableTranslatedNameRecognize	是否启用音译人名识别	false
enableJapaneseNameRecognize	是否启用日本人名识别	false
enableOrganizationRecognize	开启机构名识别	false
enablePlaceRecognize	开启地名识别	false
enableNormalization	是否执行字符正规化（繁体->简体，全角->半角，大写->小写）	false
enableTraditionalChineseMode	开启精准繁体中文分词	false
enableDebug	开启调试模式	false

更高级的配置主要通过class path下的hanlp.properties进行配置，请阅读HanLP自然语言处理包文档以了解更多相关配置，如：

用户词典
词性标注
简繁转换
……

停用词与同义词

推荐利用Lucene或Solr自带的filter实现，本插件不会越俎代庖。一个示例配置如下：

    <!-- text_cn字段类型: 指定使用HanLP分词器，同时开启索引模式。通过solr自带的停用词过滤器，使用"stopwords.txt"（默认空白）过滤。
	 在搜索的时候，还支持solr自带的同义词词典。-->
    <fieldType name="text_cn" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- 取消注释可以启用索引期间的同义词词典
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
    <field name="my_field1" type="text_cn" indexed="true" stored="true"/>
    <field name="my_field2" type="text_cn" indexed="true" stored="true"/>

调用方法

在Query改写的时候，可以利用HanLPAnalyzer分词结果中的词性等属性，如

String text = "中华人民共和国很辽阔";
for (int i = 0; i < text.length(); ++i)
{
    System.out.print(text.charAt(i) + "" + i + " ");
}
System.out.println();
Analyzer analyzer = new HanLPAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("field", text);
tokenStream.reset();
while (tokenStream.incrementToken())
{
    CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);
    // 偏移量
    OffsetAttribute offsetAtt = tokenStream.getAttribute(OffsetAttribute.class);
    // 距离
    PositionIncrementAttribute positionAttr = tokenStream.getAttribute(PositionIncrementAttribute.class);
    // 词性
    TypeAttribute typeAttr = tokenStream.getAttribute(TypeAttribute.class);
    System.out.printf("[%d:%d %d] %s/%s\n", offsetAtt.startOffset(), offsetAtt.endOffset(), positionAttr.getPositionIncrement(), attribute, typeAttr.type());
}

在另一些场景，支持以自定义的分词器（比如开启了命名实体识别的分词器、繁体中文分词器、CRF分词器等）构造HanLPTokenizer，比如：

tokenizer = new HanLPTokenizer(HanLP.newSegment()
                                    .enableJapaneseNameRecognize(true)
                                    .enableIndexMode(true), null, false);
tokenizer.setReader(new StringReader("林志玲亮相网友:确定不是波多野结衣？"));

版权

Apache License Version 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 272

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (19) 🔗