All Projects → yishn → chinese-tokenizer

yishn / chinese-tokenizer

Licence: MIT license
Tokenizes Chinese texts into words.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to chinese-tokenizer

PaddleTokenizer
使用 PaddlePaddle 实现基于深度神经网络的中文分词引擎 | A DNN Chinese Tokenizer by Using PaddlePaddle
Stars: ✭ 14 (-80.56%)
Mutual labels:  tokenizer, chinese
ModernSecurityProtectionGuide
Modern Security Protection Guide
Stars: ✭ 72 (+0%)
Mutual labels:  chinese
date-extractor
Extract dates from text
Stars: ✭ 58 (-19.44%)
Mutual labels:  chinese
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (-37.5%)
Mutual labels:  tokenizer
lexertk
C++ Lexer Toolkit Library (LexerTk) https://www.partow.net/programming/lexertk/index.html
Stars: ✭ 26 (-63.89%)
Mutual labels:  tokenizer
eslint-config-mingelz
A shared ESLint configuration with Chinese comments. 一份带有完整中文注释的 ESLint 规则。
Stars: ✭ 15 (-79.17%)
Mutual labels:  chinese
exhentai-tags-chinese-translation
E-Hentai/ExHentai 全部 TAGs 中文翻译
Stars: ✭ 273 (+279.17%)
Mutual labels:  chinese
dialectID siam
Dialect identification using Siamese network
Stars: ✭ 15 (-79.17%)
Mutual labels:  words
suika
Suika 🍉 is a Japanese morphological analyzer written in pure Ruby
Stars: ✭ 31 (-56.94%)
Mutual labels:  tokenizer
NLPDataAugmentation
Chinese NLP Data Augmentation, BERT Contextual Augmentation
Stars: ✭ 94 (+30.56%)
Mutual labels:  chinese
next-qrcode
React hooks for generating QRCode for your next React apps.
Stars: ✭ 87 (+20.83%)
Mutual labels:  chinese
Email-newsletter-RSS
邮箱 📧 newsletter RSS 荟萃 News
Stars: ✭ 1,225 (+1601.39%)
Mutual labels:  chinese
embedding study
中文预训练模型生成字向量学习,测试BERT,ELMO的中文效果
Stars: ✭ 94 (+30.56%)
Mutual labels:  chinese
grasp
Essential NLP & ML, short & fast pure Python code
Stars: ✭ 58 (-19.44%)
Mutual labels:  tokenizer
vocascan-frontend
A highly configurable vocabulary trainer
Stars: ✭ 26 (-63.89%)
Mutual labels:  words
tensorflow-chatbot-chinese
網頁聊天機器人 | tensorflow implementation of seq2seq model with bahdanau attention and Word2Vec pretrained embedding
Stars: ✭ 50 (-30.56%)
Mutual labels:  chinese
Tokenizer
A tokenizer for Icelandic text
Stars: ✭ 27 (-62.5%)
Mutual labels:  tokenizer
chinese-learner
A desktop web application for learning Mandarin Chinese and its character stroke order.
Stars: ✭ 22 (-69.44%)
Mutual labels:  chinese
Vanhiupun.github.io
🏖️ Vanhiupun's Awesome Site ==> another theme for elegant writers with modern flat style and beautiful night/dark mode.
Stars: ✭ 57 (-20.83%)
Mutual labels:  chinese
ime.vim
A Vim input method engine
Stars: ✭ 74 (+2.78%)
Mutual labels:  chinese

chinese-tokenizer Build Status

Simple algorithm to tokenize Chinese texts into words using CC-CEDICT. You can try it out at the demo page. The code for the demo page can be found in the gh-pages branch of this repository.

How this works

This tokenizer uses a simple greedy algorithm: It always looks for the longest word in the CC-CEDICT dictionary that matches the input, one at a time.

Installation

Use npm to install:

npm install chinese-tokenizer --save

Usage

Make sure to provide the CC-CEDICT data.

const tokenize = require('chinese-tokenizer').loadFile('./cedict_ts.u8')

console.log(JSON.stringify(tokenize('我是中国人。'), null, '  '))
console.log(JSON.stringify(tokenize('我是中國人。'), null, '  '))

Output:

[
  {
    "text": "我",
    "traditional": "我",
    "simplified": "我",
    "position": { "offset": 0, "line": 1, "column": 1 },
    "matches": [
      {
        "pinyin": "wo3",
        "pinyinPretty": "wǒ",
        "english": "I/me/my"
      }
    ]
  },
  {
    "text": "是",
    "traditional": "是",
    "simplified": "是",
    "position": { "offset": 1, "line": 1, "column": 2 },
    "matches": [
      {
        "pinyin": "shi4",
        "pinyinPretty": "shì",
        "english": "is/are/am/yes/to be"
      }
    ]
  },
  {
    "text": "中國人",
    "traditional": "中國人",
    "simplified": "中国人",
    "position": { "offset": 2, "line": 1, "column": 3 },
    "matches": [
      {
        "pinyin": "Zhong1 guo2 ren2",
        "pinyinPretty": "Zhōng guó rén",
        "english": "Chinese person"
      }
    ]
  },
  {
    "text": "。",
    "traditional": "。",
    "simplified": "。",
    "position": { "offset": 5, "line": 1, "column": 6 },
    "matches": []
  }
]

API

chineseTokenizer.loadFile(path)

Reads the CC-CEDICT file from given path and returns a tokenize function based on the dictionary.

chineseTokenizer.load(content)

Parses CC-CEDICT string content from content and returns a tokenize function based on the dictionary.

tokenize(text)

Tokenizes the given text string and returns an array with tokens of the following form:

{
  "text": <string>,
  "traditional": <string>,
  "simplified": <string>,
  "position": { "offset": <number>, "line": <number>, "column": <number> },
  "matches": [
    {
      "pinyin": <string>,
      "pinyinPretty": <string>,
      "english": <string>
    },
    ...
  ]
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].