All Projects → Hoiy → berserker

Hoiy / berserker

Licence: MIT license
Berserker - BERt chineSE woRd toKenizER

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to berserker

Tianchi2020ChineseMedicineQuestionGeneration
2020 阿里云天池大数据竞赛-中医药文献问题生成挑战赛
Stars: ✭ 20 (+17.65%)
Mutual labels:  sequence-to-sequence, bert
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+39052.94%)
Mutual labels:  chinese-nlp, bert
classy
classy is a simple-to-use library for building high-performance Machine Learning models in NLP.
Stars: ✭ 61 (+258.82%)
Mutual labels:  sequence-to-sequence, bert
text-generation-transformer
text generation based on transformer
Stars: ✭ 36 (+111.76%)
Mutual labels:  sequence-to-sequence, bert
Lac
百度NLP:分词,词性标注,命名实体识别,词重要性
Stars: ✭ 2,792 (+16323.53%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
HugsVision
HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision
Stars: ✭ 154 (+805.88%)
Mutual labels:  bert, state-of-the-art
bert tokenization for java
This is a java version of Chinese tokenization descried in BERT.
Stars: ✭ 39 (+129.41%)
Mutual labels:  chinese-nlp, bert
bert quora question pairs
BERT Model Fine-tuning on Quora Questions Pairs
Stars: ✭ 28 (+64.71%)
Mutual labels:  bert, tpu
G2pc
g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese
Stars: ✭ 155 (+811.76%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
Chinesenlp
Datasets, SOTA results of every fields of Chinese NLP
Stars: ✭ 1,206 (+6994.12%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
Jcseg
Jcseg is a light weight NLP framework developed with Java. Provide CJK and English segmentation based on MMSEG algorithm, With also keywords extraction, key sentence extraction, summary extraction implemented based on TEXTRANK algorithm. Jcseg had a build-in http server and search modules for the latest lucene,solr,elasticsearch
Stars: ✭ 754 (+4335.29%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+1005.88%)
Mutual labels:  tokenizer, bert
Nlp4han
中文自然语言处理工具集【断句/分词/词性标注/组块/句法分析/语义分析/NER/N元语法/HMM/代词消解/情感分析/拼写检查】
Stars: ✭ 206 (+1111.76%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
Friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
Stars: ✭ 313 (+1741.18%)
Mutual labels:  tokenizer, chinese-word-segmentation
Transformer-Transducer
PyTorch implementation of "Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss" (ICASSP 2020)
Stars: ✭ 61 (+258.82%)
Mutual labels:  sequence-to-sequence
Chinese-automatic-speech-recognition
Chinese speech recognition
Stars: ✭ 147 (+764.71%)
Mutual labels:  chinese-nlp
efficientnet-jax
EfficientNet, MobileNetV3, MobileNetV2, MixNet, etc in JAX w/ Flax Linen and Objax
Stars: ✭ 114 (+570.59%)
Mutual labels:  tpu
HE2LaTeX
Converting handwritten equations to LaTeX
Stars: ✭ 84 (+394.12%)
Mutual labels:  sequence-to-sequence
BERT-for-Chinese-Question-Answering
No description or website provided.
Stars: ✭ 75 (+341.18%)
Mutual labels:  bert
classifier multi label
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification
Stars: ✭ 127 (+647.06%)
Mutual labels:  bert

Berserker

Berserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's BERT model.

Installation

pip install basaka

Usage

import berserker

berserker.load_model() # An one-off installation
berserker.tokenize('姑姑想過過過兒過過的生活。') # ['姑姑', '想', '過', '過', '過兒', '過過', '的', '生活', '。']

Benchmark

The table below shows that Berserker achieved state-of-the-art F1 measure on the SIGHAN 2005 dataset.

The result below is trained with 15 epoches on each dataset with a batch size of 64.

PKU CITYU MSR AS
Liu et al. (2016) 96.8 -- 97.3 --
Yang et al. (2017) 96.3 96.9 97.5 95.7
Zhou et al. (2017) 96.0 -- 97.8 --
Cai et al. (2017) 95.8 95.6 97.1 --
Chen et al. (2017) 94.3 95.6 96.0 94.6
Wang and Xu (2017) 96.5 -- 98.0 --
Ma et al. (2018) 96.1 97.2 98.1 96.2
-------------------- ---------- ---------- ---------- ----------
Berserker 96.6 97.1 98.4 96.5

Reference: Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Limitation

Since Berserker is muscular is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.

Currently the default model provided is trained with SIGHAN 2005 PKU dataset. We plan to release more pretrained model in the future.

Architecture

Berserker is fine-tuned over TPU with pretrained Chinese BERT model. It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.

Training

We provided the source code for training under the trainer subdirectory. Feel free to contact me if you need any help reproducing the result.

Bonus Video

Yachae!! BERSERKER!!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].