The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Stars: ✭ 45 (-37.5%)

Mutual labels: tokenizer

lexertk

C++ Lexer Toolkit Library (LexerTk) https://www.partow.net/programming/lexertk/index.html

Stars: ✭ 26 (-63.89%)

Mutual labels: tokenizer

eslint-config-mingelz

A shared ESLint configuration with Chinese comments. 一份带有完整中文注释的 ESLint 规则。

Stars: ✭ 15 (-79.17%)

Mutual labels: chinese

exhentai-tags-chinese-translation

E-Hentai/ExHentai 全部 TAGs 中文翻译

Stars: ✭ 273 (+279.17%)

Mutual labels: chinese

dialectID siam

Dialect identification using Siamese network

Stars: ✭ 15 (-79.17%)

Mutual labels: words

suika

Suika 🍉 is a Japanese morphological analyzer written in pure Ruby

Stars: ✭ 31 (-56.94%)

Mutual labels: tokenizer

NLPDataAugmentation

Chinese NLP Data Augmentation， BERT Contextual Augmentation

Stars: ✭ 94 (+30.56%)

Mutual labels: chinese

next-qrcode

React hooks for generating QRCode for your next React apps.

Stars: ✭ 87 (+20.83%)

Mutual labels: chinese

Email-newsletter-RSS

邮箱 📧 newsletter RSS 荟萃 News

Stars: ✭ 1,225 (+1601.39%)

Mutual labels: chinese

embedding study

中文预训练模型生成字向量学习，测试BERT，ELMO的中文效果

Stars: ✭ 94 (+30.56%)

Mutual labels: chinese

grasp

Essential NLP & ML, short & fast pure Python code

Stars: ✭ 58 (-19.44%)

Mutual labels: tokenizer

vocascan-frontend

A highly configurable vocabulary trainer

Stars: ✭ 26 (-63.89%)

Mutual labels: words

tensorflow-chatbot-chinese

網頁聊天機器人 | tensorflow implementation of seq2seq model with bahdanau attention and Word2Vec pretrained embedding

Stars: ✭ 50 (-30.56%)

Mutual labels: chinese

Tokenizer

A tokenizer for Icelandic text

Stars: ✭ 27 (-62.5%)

Mutual labels: tokenizer

chinese-learner

A desktop web application for learning Mandarin Chinese and its character stroke order.

Stars: ✭ 22 (-69.44%)

Mutual labels: chinese

Vanhiupun.github.io

🏖️ Vanhiupun's Awesome Site ==> another theme for elegant writers with modern flat style and beautiful night/dark mode.

Stars: ✭ 57 (-20.83%)

Mutual labels: chinese

ime.vim

A Vim input method engine

Stars: ✭ 74 (+2.78%)

Mutual labels: chinese

View All Similar Projects ➔

chinese-tokenizer

Simple algorithm to tokenize Chinese texts into words using CC-CEDICT. You can try it out at the demo page. The code for the demo page can be found in the gh-pages branch of this repository.

How this works

This tokenizer uses a simple greedy algorithm: It always looks for the longest word in the CC-CEDICT dictionary that matches the input, one at a time.

Installation

Use npm to install:

npm install chinese-tokenizer --save

Usage

Make sure to provide the CC-CEDICT data.

const tokenize = require('chinese-tokenizer').loadFile('./cedict_ts.u8')

console.log(JSON.stringify(tokenize('我是中国人。'), null, '  '))
console.log(JSON.stringify(tokenize('我是中國人。'), null, '  '))

Output:

[
  {
    "text": "我",
    "traditional": "我",
    "simplified": "我",
    "position": { "offset": 0, "line": 1, "column": 1 },
    "matches": [
      {
        "pinyin": "wo3",
        "pinyinPretty": "wǒ",
        "english": "I/me/my"
      }
    ]
  },
  {
    "text": "是",
    "traditional": "是",
    "simplified": "是",
    "position": { "offset": 1, "line": 1, "column": 2 },
    "matches": [
      {
        "pinyin": "shi4",
        "pinyinPretty": "shì",
        "english": "is/are/am/yes/to be"
      }
    ]
  },
  {
    "text": "中國人",
    "traditional": "中國人",
    "simplified": "中国人",
    "position": { "offset": 2, "line": 1, "column": 3 },
    "matches": [
      {
        "pinyin": "Zhong1 guo2 ren2",
        "pinyinPretty": "Zhōng guó rén",
        "english": "Chinese person"
      }
    ]
  },
  {
    "text": "。",
    "traditional": "。",
    "simplified": "。",
    "position": { "offset": 5, "line": 1, "column": 6 },
    "matches": []
  }
]

API