All Projects → DCjanus → cang-jie

DCjanus / cang-jie

Licence: MIT License
Chinese tokenizer for tantivy, based on jieba-rs

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to cang-jie

Friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
Stars: ✭ 313 (+552.08%)
Mutual labels:  tokenizer, full-text-search
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (-33.33%)
Mutual labels:  tokenizer
elasticsearch-plugins
Some native scoring script plugins for elasticsearch
Stars: ✭ 30 (-37.5%)
Mutual labels:  tokenizer
tokenizer
Tokenize CSS according to the CSS Syntax
Stars: ✭ 52 (+8.33%)
Mutual labels:  tokenizer
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-64.58%)
Mutual labels:  tokenizer
poyonga
Python Groonga Client
Stars: ✭ 19 (-60.42%)
Mutual labels:  full-text-search
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (+43.75%)
Mutual labels:  tokenizer
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+291.67%)
Mutual labels:  tokenizer
mystem-scala
Morphological analyzer `mystem` (Russian language) wrapper for JVM languages
Stars: ✭ 21 (-56.25%)
Mutual labels:  tokenizer
ilmulti
Tooling to play around with multilingual machine translation for Indian Languages.
Stars: ✭ 19 (-60.42%)
Mutual labels:  tokenizer
vscode-blockman
VSCode extension to highlight nested code blocks
Stars: ✭ 233 (+385.42%)
Mutual labels:  tokenizer
wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Stars: ✭ 51 (+6.25%)
Mutual labels:  tokenizer
gatsby-plugin-lunr
Gatsby plugin for full text search implementation based on lunr client-side index. Supports multilanguage search.
Stars: ✭ 69 (+43.75%)
Mutual labels:  full-text-search
jargon
Tokenizers and lemmatizers for Go
Stars: ✭ 98 (+104.17%)
Mutual labels:  tokenizer
bredon
A modern CSS value compiler in JavaScript
Stars: ✭ 39 (-18.75%)
Mutual labels:  tokenizer
neural tokenizer
Tokenize English sentences using neural networks.
Stars: ✭ 64 (+33.33%)
Mutual labels:  tokenizer
wink-bm25-text-search
Fast Full Text Search based on BM25
Stars: ✭ 44 (-8.33%)
Mutual labels:  full-text-search
paperless-ng
A supercharged version of paperless: scan, index and archive all your physical documents
Stars: ✭ 4,840 (+9983.33%)
Mutual labels:  full-text-search
PaddleTokenizer
使用 PaddlePaddle 实现基于深度神经网络的中文分词引擎 | A DNN Chinese Tokenizer by Using PaddlePaddle
Stars: ✭ 14 (-70.83%)
Mutual labels:  tokenizer
lucilla
Fast, efficient, in-memory Full Text Search for Kotlin
Stars: ✭ 102 (+112.5%)
Mutual labels:  full-text-search

cang-jie(仓颉)

Crates.io latest document dependency status

A Chinese tokenizer for tantivy, based on jieba-rs.

As of now, only support UTF-8.

Example

    let mut schema_builder = SchemaBuilder::default();
    let text_indexing = TextFieldIndexing::default()
        .set_tokenizer(CANG_JIE) // Set custom tokenizer
        .set_index_option(IndexRecordOption::WithFreqsAndPositions);
    let text_options = TextOptions::default()
        .set_indexing_options(text_indexing)
        .set_stored();
    // ... Some code   
     let index = Index::create(RAMDirectory::create(), schema.clone())?;
     let tokenizer = CangJieTokenizer {
                        worker: Arc::new(Jieba::empty()), // empty dictionary
                        option: TokenizerOption::Unicode,
                     };
     index.tokenizers().register(CANG_JIE, tokenizer); 
    // ... Some code

Full example

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].