All Projects → kazuhira-r → kuromoji-with-mecab-neologd-buildscript

kazuhira-r / kuromoji-with-mecab-neologd-buildscript

Licence: Apache-2.0 license
These scripts to build a Lucene Kuromoji or Atilika Kuromoji with bundled mecab-ipadic-NEologd.

Programming Languages

shell
77523 projects

Projects that are alternatives of or similar to kuromoji-with-mecab-neologd-buildscript

Mecab Ipadic Neologd
Neologism dictionary based on the language resources on the Web for mecab-ipadic
Stars: ✭ 2,408 (+12573.68%)
Mutual labels:  mecab, mecab-ipadic
kuromojin
Provide a high-level wrapper for kuromoji.js. Cache/Promise API
Stars: ✭ 64 (+236.84%)
Mutual labels:  kuromoji
python-mecab
A repository to bind mecab for Python 3.5+. Not using swig nor pybind. (Not Maintained Now)
Stars: ✭ 27 (+42.11%)
Mutual labels:  mecab
jstarcraft-nlp
专注于解决自然语言处理领域的几个核心问题:词法分析,句法分析,语义分析,语种检测,信息抽取,文本聚类和文本分类. 为相关领域的研发人员提供完整的通用设计与参考实现. 涵盖了多种自然语言处理算法,适配了多个自然语言处理框架. 兼容Lucene/Solr/ElasticSearch插件.
Stars: ✭ 92 (+384.21%)
Mutual labels:  lucene
luke
Please use the luke bundled with lucene! This repo is archived and frozen now.
Stars: ✭ 101 (+431.58%)
Mutual labels:  lucene
alix
A Lucene Indexer for XML, with lexical analysis (lemmatization for French)
Stars: ✭ 15 (-21.05%)
Mutual labels:  lucene
querqy-elasticsearch
Querqy for Elasticsearch
Stars: ✭ 37 (+94.74%)
Mutual labels:  lucene
kuromoji-for-bigquery
Tokenize Japanese text on BigQuery with Kuromoji in Apache Beam/Google Dataflow at scale
Stars: ✭ 11 (-42.11%)
Mutual labels:  kuromoji
lucene-geo-gazetteer
Uses Apache Lucene, OpenNLP and geonames and extracts locations from text and geocodes them.
Stars: ✭ 34 (+78.95%)
Mutual labels:  lucene
beagle
Beagle helps you identify keywords, phrases, regexes, and complex search queries of interest in streams of text documents.
Stars: ✭ 46 (+142.11%)
Mutual labels:  lucene
limelight
A php Japanese language text analyzer and parser.
Stars: ✭ 76 (+300%)
Mutual labels:  mecab
NMeCab
Japanese morphological analyzer on .NET
Stars: ✭ 65 (+242.11%)
Mutual labels:  mecab
nlpir-analysis-cn-ictclas
Lucene/Solr Analyzer Plugin. Support MacOS,Linux x86/64,Windows x86/64. It's a maven project, which allows you change the lucene/solr version. //Maven工程,修改Lucene/Solr版本,以兼容相应版本。
Stars: ✭ 71 (+273.68%)
Mutual labels:  lucene
liqe
Lightweight and performant Lucene-like parser, serializer and search engine.
Stars: ✭ 513 (+2600%)
Mutual labels:  lucene
explicit-semantic-analysis
Wikipedia-based Explicit Semantic Analysis, as described by Gabrilovich and Markovitch
Stars: ✭ 34 (+78.95%)
Mutual labels:  lucene
solr-container
Ansible Container project that manages the lifecycle of Apache Solr on Docker.
Stars: ✭ 17 (-10.53%)
Mutual labels:  lucene
myblog
项目:一款Github上开源的博客系统项目 目的:对学到的框架、开源组件、前端技术进行应用学习。同时开发完成后写技术博客,开源到Github上
Stars: ✭ 23 (+21.05%)
Mutual labels:  lucene
HongsCORE
Hong's Common Object Requesting Engine
Stars: ✭ 49 (+157.89%)
Mutual labels:  lucene
lupyne
Pythonic search engine based on PyLucene.
Stars: ✭ 61 (+221.05%)
Mutual labels:  lucene
lucene
Apache Lucene open-source search software
Stars: ✭ 1,009 (+5210.53%)
Mutual labels:  lucene

buildscript for Kuromoji with mecab-neologd

These scripts to build a Lucene Kuromoji or Atilika Kuromoji with bundled mecab-ipadic-NEologd.

What's Lucene Kuromoji

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Kuromoji is morphological analyzer which is included in Apache Lucene.

What's Atilika Kuromoji

Kuromoji is an open source Japanese morphological analyzer written in Java.

What's NEologd

mecab-ipadic-NEologd : Neologism dictionary for MeCab

Note: These build scripts are supporting is only IPA dictionary.

Supported versions

Lucene Kuromoji: 4.x, 5.x, 6.x, 7.x, 8.x

Atilika Kuromoji: 0.9.0

Usage

Requirements

To use this script, you must install the following software.

Note: Many CPU and memory resource are used by a build. About 5-6 GB of JavaVM heap is needed at present.

Build Lucene Kuromoji for mecab-ipadic-NEologd

Install

$ git clone https://github.com/kazuhira-r/kuromoji-with-mecab-neologd-buildscript

or

$ wget https://raw.githubusercontent.com/kazuhira-r/kuromoji-with-mecab-neologd-buildscript/master/build-lucene-kuromoji-with-mecab-ipadic-neologd.sh

Please to grant execute permissions.

$ chmod a+x build-lucene-kuromoji-with-mecab-ipadic-neologd.sh

Build

In any directory, please run the script.

$ /path/to/build-lucene-kuromoji-with-mecab-ipadic-neologd.sh

The setting when execute, is indicated.

### [2016-12-18 17:57:02] [main] [INFO] START.

####################################################################
applied build options.

[Auto Install MeCab Version                  ]    ... mecab-0.996
[mecab-ipadic-NEologd Tag                (-N)]    ... master

*** deprecated option *** 
[install adjective ext                   (-T)]    ... 0
*** deprecated option *** 


[Max BaseForm Length                         ]    ... 15
[Lucene Version Tag                      (-L)]    ... releases/lucene-solr/6.3.0
[Kuromoji build Max Heapsize             (-M)]    ... 6g
[Kuromoji JAR File Output Directory Name (-o)]    ... .
[Kuromoji Package Name                   (-p)]    ... org.apache.lucene.analysis.ja

####################################################################

That were built JAR file will be created in user specified directory (default: current directory) where you run the script.

$ ls -l
total 51832
-rw-rw-r-- 1 xyz xyz 51655324 Dec 18 18:05 lucene-analyzers-kuromoji-ipadic-neologd-6.3.0-20161215.jar
drwxrwxr-x 6 xyz xyz     4096 Dec 18 18:02 lucene-solr
drwxrwxr-x 8 xyz xyz     4096 Jul 23 00:32 mecab
drwxr-xr-x 8 xyz xyz     4096 Jul 23 00:31 mecab-0.996
-rw-rw-r-- 1 xyz xyz  1398663 Jul 23 00:31 mecab-0.996.tar.gz
drwxrwxr-x 9 xyz xyz     4096 Dec 18 17:59 mecab-ipadic-neologd

In this case, it is "lucene-analyzers-kuromoji-ipadic-neologd-6.3.0-20161215.jar" JAR file that was built.

JAR file naming

Naming of a JAR file of a build result is as follows.

naming:
lucene-analyzers-kuromoji-ipadic-neologd-[Lucene Version]-[mecab-ipadic-NEologd dictionary date].jar

example:
lucene-analyzers-kuromoji-ipadic-neologd-6.3.0-20161215.jar

Build options

  • -N - branch or tag name in mecab-ipadic-NEologd, included in a build. default: master
  • ***deprecated*** -T - install adjective ext. if you want enable, specified 1. default: 0
  • -L - branch or tag name in Apache Lucene of a build target. default: current Apache Lucene latest release tag.
  • -M - Kuromoji build max heapsize.
  • -o - generated Kuromoji JAR file output directory. (default: . (current directory))
  • -p - package name at the time of a build. default: org.apache.lucene.analysis.ja (original package)

Build Atilika Kuromoji for mecab-ipadic-NEologd

Install

$ git clone https://github.com/kazuhira-r/kuromoji-with-mecab-neologd-buildscript

or

$ wget https://raw.githubusercontent.com/kazuhira-r/kuromoji-with-mecab-neologd-buildscript/master/build-atilika-kuromoji-with-mecab-ipadic-neologd.sh

Please to grant execute permissions.

$ chmod a+x build-atilika-kuromoji-with-mecab-ipadic-neologd.sh

Build

In any directory, please run the script.

$ /path/to/build-atilika-kuromoji-with-mecab-ipadic-neologd.sh

The setting when execute, is indicated.

### [2016-12-18 23:10:54] [main] [INFO] START.

####################################################################
applied build options.

[Auto Install MeCab Version                  ]    ... mecab-0.996
[mecab-ipadic-NEologd Tag                (-N)]    ... master

*** deprecated option *** 
[install adjective ext                   (-T)]    ... 0
*** deprecated option *** 

[Kuromoji Version Tag                    (-K)]    ... 0.9.0
[Kuromoji build Max Heapsize             (-M)]    ... 7g
[Kuromoji JAR File Output Directory Name (-o)]    ... .
[Kuromoji Package Name                   (-p)]    ... com.atilika.kuromoji.ipadic

####################################################################

That were built JAR file will be created in user specified directory (default: current directory) where you run the script.

$ ls -l
total 133572
drwxrwxr-x 10 xyz xyz      4096 Dec 18 23:13 kuromoji
-rw-rw-r--  1 xyz xyz 135352388 Dec 18 23:33 kuromoji-ipadic-neologd-0.9.0-20161215.jar
drwxrwxr-x  8 xyz xyz      4096 Dec 18 22:39 mecab
drwxr-xr-x  8 xyz xyz      4096 Dec 18 22:39 mecab-0.996
-rw-rw-r--  1 xyz xyz   1398663 Jul 23 00:32 mecab-0.996.tar.gz
drwxrwxr-x  9 xyz xyz      4096 Dec 18 23:11 mecab-ipadic-neologd

In this case, it is "kuromoji-ipadic-neologd-0.9.0-20161215.jar" JAR file that was built.

JAR file naming

Naming of a JAR file of a build result is as follows.

naming:
kuromoji-ipadic-neologd-[Atilika Kuromoji Version]-[mecab-ipadic-NEologd dictionary date].jar

example:
kuromoji-ipadic-neologd-0.9.0-20161215.jar

Build options

  • -N - branch or tag name in mecab-ipadic-NEologd, included in a build. default: master
  • ***deprecated*** -T - install adjective ext. if you want enable, specified 1. default: 0
  • -K - branch or tag name in Atilika Kuromoji of a build target. default: current Atilika Kuromoji latest release tag.
  • -M - Kuromoji build max heapsize.
  • -o - generated Kuromoji JAR file output directory. (default: . (current directory))
  • -p - package name at the time of a build. default: com.atilika.kuromoji.ipadic (original package)

Internal Process

This script, perform the following processing.

  • Check the installation of MeCab, Installing MeCab in the current directory unless MeCab is not installed
  • Clone mecab-ipadic-NEologd
  • Generate a dictionary CSV(using libexec/make-mecab-ipadic-neologd.sh -L)
  • Clone Apache Lucene or Atilika Kuromoji source code
  • (Lucene Kuromoji only) Edit Apache Lucene Kuromoji's build.xml
  • Rename package name, when being necessary
  • Build Kuromoji and dictionary with mecab-ipadic-NEologd
  • Copy JAR file to specified directory (default: current directory)

LICENSE

Copyright © 2015, 2016, 2017, 2018, 2019 kazuhira-r

Licensed under the Apache License, Version 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].