All Projects → junhewk → RcppMeCab

junhewk / RcppMeCab

Licence: other
RcppMeCab: Rcpp Interface of CJK Morpheme Analyzer MeCab

Programming Languages

C++
36643 projects - #6 most used programming language
r
7636 projects
Makefile
30231 projects

Projects that are alternatives of or similar to RcppMeCab

citar
Citar HMM part-of-speech tagger
Stars: ✭ 16 (-33.33%)
Mutual labels:  pos, tagger
escpos-tools
Utilities to read ESC/POS print data
Stars: ✭ 145 (+504.17%)
Mutual labels:  pos
rTRNG
R package providing access and examples to TRNG C++ library
Stars: ✭ 17 (-29.17%)
Mutual labels:  rcpp
rpicore
RPICoin - Proof of Stake Cryptocurrency
Stars: ✭ 16 (-33.33%)
Mutual labels:  pos
AngularPos
A real-time, simple web Point of Sale system written with Angular 12, Firebase (Cloud Firestore), Bootstrap 4 and PrimeNg
Stars: ✭ 67 (+179.17%)
Mutual labels:  pos
blockchain consensus algorithm
代码实现五种区块链共识算法 The code implements five blockchain consensus algorithms
Stars: ✭ 251 (+945.83%)
Mutual labels:  pos
bizbook-client
The repository of bizbook client project
Stars: ✭ 28 (+16.67%)
Mutual labels:  pos
ESCPOS
A ESC/POS Printer Commands Helper
Stars: ✭ 26 (+8.33%)
Mutual labels:  pos
flutter-pos
A mobile POS app written with Flutter, compatible Sunmi device
Stars: ✭ 106 (+341.67%)
Mutual labels:  pos
chrome-raw-print
Chrome app to enable raw printing from a browser
Stars: ✭ 57 (+137.5%)
Mutual labels:  pos
pos-mamba-sdk
SDK for developing in the Mamba web environment
Stars: ✭ 34 (+41.67%)
Mutual labels:  pos
pinyin data
🐼 Easy to use and portable pronunciation data for Hanzi characters.
Stars: ✭ 13 (-45.83%)
Mutual labels:  cjk
URT
Fast Unit Root Tests and OLS regression in C++ with wrappers for R and Python
Stars: ✭ 70 (+191.67%)
Mutual labels:  rcpp
RcppXPtrUtils
XPtr Add-Ons for 'Rcpp'
Stars: ✭ 17 (-29.17%)
Mutual labels:  rcpp
rcppfastfloat
Rcpp Bindings for the 'fastfloat' Header-Only Library
Stars: ✭ 18 (-25%)
Mutual labels:  rcpp
rcpp progress
RcppProgress R package: An interruptible progress bar with OpenMP support for c++ in R packages
Stars: ✭ 26 (+8.33%)
Mutual labels:  rcpp
hashmap
Faster hash maps in R
Stars: ✭ 72 (+200%)
Mutual labels:  rcpp
rcpparrayfire
R and ArrayFire library via Rcpp
Stars: ✭ 17 (-29.17%)
Mutual labels:  rcpp
nodejs-support
한국어 형태소 및 구문 분석기의 모음인, KoalaNLP의 Javascript(Node.js) 버전입니다.
Stars: ✭ 81 (+237.5%)
Mutual labels:  tagger
tag-picker
Better tags input interaction with JavaScript.
Stars: ✭ 27 (+12.5%)
Mutual labels:  tagger

RcppMeCab

License R CRAN Downloads

This package, RcppMeCab, is a Rcpp wrapper for the part-of-speech morphological analyzer MeCab. It supports native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and Korean) MeCab library. This package fully utilizes the power Rcpp brings R computation to analyze texts faster.

Please see this for easy installation and usage examples in Korean.

Installation

Linux and Mac OSX

First, install MeCab of your language-of-choice.

Second, you can install RcppMeCab from CRAN with:

install.packages("RcppMeCab") # build from source
# install.packages("devtools")
install_github("junhewk/RcppMeCab") # install developmental version

Windows

You should set the language you want to use for the analysis with the environment variable MECAB_LANG. The default value is ko and if you want to analyze Japanese or Chinese, please set it as ja before install the package.

install.packages("RcppMeCab") # for installing Korean version

# or, install for Japanese
Sys.setenv(MECAB_LANG = 'ja') # for installing Japanese developmental version
install.packages("RcppMeCab", type="source") # build from source

# install.packages("devtools")
install_github("junhewk/RcppMeCab") # install developmental version

For analyzing, you also need MeCab binary and dictionary.

For Korean:

Install mecab-ko-msvc and mecab-ko-dic-msvc up to your 32-bit or 64-bit Windows version in C:\mecab. Provide directory location to RcppMeCab function.

For Japanese:

Install mecab binary. Provide directory location to RcppMeCab function. For example: pos(sentence, sys_dic = "C:/PROGRA~2/mecab/dic/ipadic")

Usage

This package has pos and posParallel function.

pos(sentence) # returns list, sentence will present on the names of the list
pos(sentence, join = FALSE) # for yielding morphemes only (tags will be given on the vector names)
pos(sentence, format = "data.frame") # the result will returned as a data frame format
pos(sentence, user_dic) # gets a compiled user dictionary 
posParallel(sentence, user_dic) # parallelized version uses more memory, but much faster than the loop in single threading
  • sentence: a text for analyzing
  • join: If it gets TRUE, output form is (morpheme/tag). If it gets FALSE, output form is (morpheme) + tag in attribute.
  • format: The default is a list. If you set this as "data.frame", the function will return the result in a data frame format.
  • sys_dic: a directory in which dicrc file is located, default value is "" or you can set your default value using options(mecabSysDic = "")
  • user_dic: a user dictionary file compiled by mecab_dict_index, default value is also ""

Compiling User Dictionary

MeCab API has DictionaryCompiler, but it contains die(). Hence, calling it in Rcpp crashes down entire R session. This will not be included in RcppMeCab functions.

Please refer to Mecab for Japanese.

Unix and Mac OSX

You should have model_file if you want the library to estimate cost automatically.

You need entire mecab-ko-dic source if you want to compile Korean user dictionary. User dictionary should also be prepared in CSV file. CSV structure is found in Japanese and Korean.

Compile:

$ /usr/local/libexec/mecab/mecab-dict-index -m `model_file` -d `mecab_dic_location` -u `user_dictionary_file_name` -f `CSV file charset` -t `original dictionary charset` `target_csv

# example

$ /usr/local/libexec/mecab/mecab-dict-index -m /usr/local/lib/mecab/dic/mecab-ko-dic/model.bin -d ~/mecab-ko-dic-2.0.3-20170922 -u userdic.dic -f utf8 -t utf8 ~/person.csv

Windows

  • Korean: mecab-ko-msvc has mecab-dict-index.exe.
  • Japanese: MeCab binary version has mecab-dict-index.exe.

You can use it in the same way the Linux binary compiles the dictionary.

TODOs

  • Provide multilanguage manuals for international support

Author

Junhewk Kim ([email protected])

Contributor

Kato Akiru

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].