All Projects → lindera-morphology → lindera

lindera-morphology / lindera

Licence: MIT license
A morphological analysis library.

Programming Languages

rust
11053 projects
Makefile
30231 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to lindera

Analyzers
C# code analyzers
Stars: ✭ 18 (-92.04%)
Mutual labels:  analyzer
Cometary
Roslyn extensions, with a touch of meta-programming.
Stars: ✭ 31 (-86.28%)
Mutual labels:  analyzer
xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.
Stars: ✭ 26 (-88.5%)
Mutual labels:  tokenizer
Tokenizer
A tokenizer for Icelandic text
Stars: ✭ 27 (-88.05%)
Mutual labels:  tokenizer
WikiChron
Data visualization tool for wikis evolution
Stars: ✭ 19 (-91.59%)
Mutual labels:  analyzer
whatsanalyze
Analyze your WhatsApp Chat in Seconds. Reveal insights & get statistics, while all data stays on your device. No chat data is sent to a server it runs only locally in your browser.
Stars: ✭ 41 (-81.86%)
Mutual labels:  analyzer
klay
KLAY - Korean Language AnalYzer (한국어 형태소 분석기)
Stars: ✭ 19 (-91.59%)
Mutual labels:  analyzer
gd-tokenizer
A small godot project with a tokenizer written in GDScript.
Stars: ✭ 34 (-84.96%)
Mutual labels:  tokenizer
discord
GitHub webhook that analyzes pull requests and adds comments about incompatible CSS
Stars: ✭ 29 (-87.17%)
Mutual labels:  analyzer
snapdragon-lexer
Converts a string into an array of tokens, with useful methods for looking ahead and behind, capturing, matching, et cetera.
Stars: ✭ 19 (-91.59%)
Mutual labels:  tokenizer
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (-80.09%)
Mutual labels:  tokenizer
mp4analyser
mp4 file analyser written in Python
Stars: ✭ 50 (-77.88%)
Mutual labels:  analyzer
PerformanceAnalyzer
Under the iOS platform, the analyzer is a tool which statistics CPU, FPS, Memory, Loading-Time and provides the output of statistical data. And contain SQL execution time monitor base on FMDatabase and UI refresh in main thread monitor
Stars: ✭ 42 (-81.42%)
Mutual labels:  analyzer
lexertk
C++ Lexer Toolkit Library (LexerTk) https://www.partow.net/programming/lexertk/index.html
Stars: ✭ 26 (-88.5%)
Mutual labels:  tokenizer
flvAnalyser
FLV v1.0 analyser
Stars: ✭ 76 (-66.37%)
Mutual labels:  analyzer
grasp
Essential NLP & ML, short & fast pure Python code
Stars: ✭ 58 (-74.34%)
Mutual labels:  tokenizer
chinese-tokenizer
Tokenizes Chinese texts into words.
Stars: ✭ 72 (-68.14%)
Mutual labels:  tokenizer
SwiLex
A universal lexer library in Swift.
Stars: ✭ 29 (-87.17%)
Mutual labels:  tokenizer
python-mecab
A repository to bind mecab for Python 3.5+. Not using swig nor pybind. (Not Maintained Now)
Stars: ✭ 27 (-88.05%)
Mutual labels:  tokenizer
hprof-slurp
JVM heap dump analyzer
Stars: ✭ 65 (-71.24%)
Mutual labels:  analyzer

Lindera

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera

A morphological analysis library in Rust. This project fork from kuromoji-rs.

Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.

The following products are required to build:

  • Rust >= 1.46.0

Usage

Make sure you activated the full features of the lindera crate on Cargo.toml:

[dependencies]
lindera = { version = "0.12.0", features = ["full"] }

Basic example

This example covers the basic usage of Lindera.

It will:

  • Create a tokenizer in normal mode
  • Tokenize the input text
  • Output the tokens
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    // create tokenizer
    let tokenizer = Tokenizer::new()?;

    // tokenize the text
    let tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")?;

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=ipadic --example=basic_example

You can see the result as follows:

関西国際空港
限定
トートバッグ

User dictionary example

You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.

<surface_form>,<part_of_speech>,<reading>

For example:

% cat ./resources/simple_userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

With an user dictionary, Tokenizer will be created as follows:

use std::path::PathBuf;

use lindera::LinderaResult;
use lindera::{
    mode::Mode,
    tokenizer::{
        DictionaryConfig, DictionaryKind, DictionarySourceType, Tokenizer, TokenizerConfig,
        UserDictionaryConfig,
    },
};

fn main() -> LinderaResult<()> {
    let dictionary = DictionaryConfig {
        kind: DictionaryKind::IPADIC,
        path: None,
    };

    let user_dictionary = Some(UserDictionaryConfig {
        kind: DictionaryKind::IPADIC,
        source_type: DictionarySourceType::Csv,
        path: PathBuf::from("./resources/ipadic_simple_userdic.csv"),
    });

    // create tokenizer
    let config = TokenizerConfig {
        dictionary,
        user_dictionary: user_dictionary,
        mode: Mode::Normal,
    };
    let tokenizer = Tokenizer::with_config(config)?;

    // tokenize the text
    let tokens = tokenizer.tokenize("東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です")?;

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be by cargo run --example:

% cargo run --features=ipadic --example=userdic_example
東京スカイツリー
の
最寄り駅
は
とうきょうスカイツリー駅
です

API reference

The API reference is available. Please see following URL:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].