Jcseg is a light weight NLP framework developed with Java. Provide CJK and English segmentation based on MMSEG algorithm, With also keywords extraction, key sentence extraction, summary extraction implemented based on TEXTRANK algorithm. Jcseg had a build-in http server and search modules for the latest lucene,solr,elasticsearch

Stars: ✭ 754 (+4335.29%)

Mutual labels: chinese-nlp, chinese-word-segmentation

text2text

Text2Text: Cross-lingual natural language processing and generation toolkit

Stars: ✭ 188 (+1005.88%)

Mutual labels: tokenizer, bert

Nlp4han

中文自然语言处理工具集【断句/分词/词性标注/组块/句法分析/语义分析/NER/N元语法/HMM/代词消解/情感分析/拼写检查】

Stars: ✭ 206 (+1111.76%)

Mutual labels: chinese-nlp, chinese-word-segmentation

Friso

High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

Stars: ✭ 313 (+1741.18%)

Mutual labels: tokenizer, chinese-word-segmentation

Transformer-Transducer

PyTorch implementation of "Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss" (ICASSP 2020)

Stars: ✭ 61 (+258.82%)

Mutual labels: sequence-to-sequence

Chinese-automatic-speech-recognition

Chinese speech recognition

Stars: ✭ 147 (+764.71%)

Mutual labels: chinese-nlp

efficientnet-jax

EfficientNet, MobileNetV3, MobileNetV2, MixNet, etc in JAX w/ Flax Linen and Objax

Stars: ✭ 114 (+570.59%)

Mutual labels: tpu

HE2LaTeX

Converting handwritten equations to LaTeX

Stars: ✭ 84 (+394.12%)

Mutual labels: sequence-to-sequence

BERT-for-Chinese-Question-Answering

No description or website provided.

Stars: ✭ 75 (+341.18%)

Mutual labels: bert

classifier multi label

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification

Stars: ✭ 127 (+647.06%)

Mutual labels: bert

View All Similar Projects ➔

Berserker

Berserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's BERT model.

Installation

pip install basaka

Usage

import berserker

berserker.load_model() # An one-off installation
berserker.tokenize('姑姑想過過過兒過過的生活。') # ['姑姑', '想', '過', '過', '過兒', '過過', '的', '生活', '。']

Benchmark

The table below shows that Berserker achieved state-of-the-art F1 measure on the SIGHAN 2005 dataset.

The result below is trained with 15 epoches on each dataset with a batch size of 64.

	PKU	CITYU	MSR	AS
Liu et al. (2016)	96.8	--	97.3	--
Yang et al. (2017)	96.3	96.9	97.5	95.7
Zhou et al. (2017)	96.0	--	97.8	--
Cai et al. (2017)	95.8	95.6	97.1	--
Chen et al. (2017)	94.3	95.6	96.0	94.6
Wang and Xu (2017)	96.5	--	98.0	--
Ma et al. (2018)	96.1	97.2	98.1	96.2
--------------------	----------	----------	----------	----------
Berserker	96.6	97.1	98.4	96.5

Reference: Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Limitation

Since Berserker ~~is muscular~~ is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.

Currently the default model provided is trained with SIGHAN 2005 PKU dataset. We plan to release more pretrained model in the future.

Architecture

Berserker is fine-tuned over TPU with pretrained Chinese BERT model. It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.

Training

We provided the source code for training under the trainer subdirectory. Feel free to contact me if you need any help reproducing the result.

Bonus Video

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Hoiy / berserker

Programming Languages

Labels

Projects that are alternatives of or similar to berserker