DCjanus / cang-jie

Licence: MIT License

Chinese tokenizer for tantivy, based on jieba-rs

Programming Languages

11053 projects

Projects that are alternatives of or similar to cang-jie

High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

Stars: ✭ 313 (+552.08%)

Mutual labels: tokenizer, full-text-search

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Stars: ✭ 32 (-33.33%)

Mutual labels: tokenizer

elasticsearch-plugins

Some native scoring script plugins for elasticsearch

Stars: ✭ 30 (-37.5%)

Mutual labels: tokenizer

tokenizer

Tokenize CSS according to the CSS Syntax

Stars: ✭ 52 (+8.33%)

Mutual labels: tokenizer

berserker

Berserker - BERt chineSE woRd toKenizER

Stars: ✭ 17 (-64.58%)

Mutual labels: tokenizer

poyonga

Python Groonga Client

Stars: ✭ 19 (-60.42%)

Mutual labels: full-text-search

farasapy

A Python implementation of Farasa toolkit

Stars: ✭ 69 (+43.75%)

Mutual labels: tokenizer

text2text

Text2Text: Cross-lingual natural language processing and generation toolkit

Stars: ✭ 188 (+291.67%)

Mutual labels: tokenizer

mystem-scala

Morphological analyzer `mystem` (Russian language) wrapper for JVM languages

Stars: ✭ 21 (-56.25%)

Mutual labels: tokenizer

ilmulti

Tooling to play around with multilingual machine translation for Indian Languages.

Stars: ✭ 19 (-60.42%)

Mutual labels: tokenizer

vscode-blockman

VSCode extension to highlight nested code blocks

Stars: ✭ 233 (+385.42%)

Mutual labels: tokenizer

wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

Stars: ✭ 51 (+6.25%)

Mutual labels: tokenizer

gatsby-plugin-lunr

Gatsby plugin for full text search implementation based on lunr client-side index. Supports multilanguage search.

Stars: ✭ 69 (+43.75%)

Mutual labels: full-text-search

jargon

Tokenizers and lemmatizers for Go

Stars: ✭ 98 (+104.17%)

Mutual labels: tokenizer

bredon

A modern CSS value compiler in JavaScript

Stars: ✭ 39 (-18.75%)

Mutual labels: tokenizer

neural tokenizer

Tokenize English sentences using neural networks.

Stars: ✭ 64 (+33.33%)

Mutual labels: tokenizer

wink-bm25-text-search

Fast Full Text Search based on BM25

Stars: ✭ 44 (-8.33%)

Mutual labels: full-text-search

paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents

Stars: ✭ 4,840 (+9983.33%)

Mutual labels: full-text-search

PaddleTokenizer

使用 PaddlePaddle 实现基于深度神经网络的中文分词引擎 | A DNN Chinese Tokenizer by Using PaddlePaddle

Stars: ✭ 14 (-70.83%)

Mutual labels: tokenizer

lucilla

Fast, efficient, in-memory Full Text Search for Kotlin

Stars: ✭ 102 (+112.5%)

Mutual labels: full-text-search

View All Similar Projects ➔

cang-jie(仓颉)

A Chinese tokenizer for tantivy, based on jieba-rs.

As of now, only support UTF-8.

Example

    let mut schema_builder = SchemaBuilder::default();
    let text_indexing = TextFieldIndexing::default()
        .set_tokenizer(CANG_JIE) // Set custom tokenizer
        .set_index_option(IndexRecordOption::WithFreqsAndPositions);
    let text_options = TextOptions::default()
        .set_indexing_options(text_indexing)
        .set_stored();
    // ... Some code   
     let index = Index::create(RAMDirectory::create(), schema.clone())?;
     let tokenizer = CangJieTokenizer {
                        worker: Arc::new(Jieba::empty()), // empty dictionary
                        option: TokenizerOption::Unicode,
                     };
     index.tokenizers().register(CANG_JIE, tokenizer); 
    // ... Some code

Full example

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

DCjanus / cang-jie

Programming Languages

Labels

Projects that are alternatives of or similar to cang-jie

cang-jie(仓颉)

Example