All Projects → zhongbin1 → bert_tokenization_for_java

zhongbin1 / bert_tokenization_for_java

Licence: Apache-2.0 license
This is a java version of Chinese tokenization descried in BERT.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to bert tokenization for java

berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-56.41%)
Mutual labels:  chinese-nlp, bert
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+16966.67%)
Mutual labels:  chinese-nlp, bert
DeepNER
An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.
Stars: ✭ 9 (-76.92%)
Mutual labels:  bert
BERT-Chinese-Couplet
BERT for Chinese Couplet | BERT用于自动对对联
Stars: ✭ 19 (-51.28%)
Mutual labels:  bert
polycash
The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.
Stars: ✭ 24 (-38.46%)
Mutual labels:  tokenization
Self-Supervised-Embedding-Fusion-Transformer
The code for our IEEE ACCESS (2020) paper Multimodal Emotion Recognition with Transformer-Based Self Supervised Feature Fusion.
Stars: ✭ 57 (+46.15%)
Mutual labels:  bert
ParsBigBird
Persian Bert For Long-Range Sequences
Stars: ✭ 58 (+48.72%)
Mutual labels:  bert
SentimentAnalysis
(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset
Stars: ✭ 40 (+2.56%)
Mutual labels:  bert
bern
A neural named entity recognition and multi-type normalization tool for biomedical text mining
Stars: ✭ 151 (+287.18%)
Mutual labels:  bert
spacy russian tokenizer
Custom Russian tokenizer for spaCy
Stars: ✭ 35 (-10.26%)
Mutual labels:  tokenization
golgotha
Contextualised Embeddings and Language Modelling using BERT and Friends using R
Stars: ✭ 39 (+0%)
Mutual labels:  bert
LMMS
Language Modelling Makes Sense - WSD (and more) with Contextual Embeddings
Stars: ✭ 79 (+102.56%)
Mutual labels:  bert
ganbert-pytorch
Enhancing the BERT training with Semi-supervised Generative Adversarial Networks in Pytorch/HuggingFace
Stars: ✭ 60 (+53.85%)
Mutual labels:  bert
transformer-models
Deep Learning Transformer models in MATLAB
Stars: ✭ 90 (+130.77%)
Mutual labels:  bert
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+287.18%)
Mutual labels:  bert
SQUAD2.Q-Augmented-Dataset
Augmented version of SQUAD 2.0 for Questions
Stars: ✭ 31 (-20.51%)
Mutual labels:  bert
muse-as-service
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.
Stars: ✭ 45 (+15.38%)
Mutual labels:  bert
bert-AAD
Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation
Stars: ✭ 27 (-30.77%)
Mutual labels:  bert
ark-nlp
A private nlp coding package, which quickly implements the SOTA solutions.
Stars: ✭ 232 (+494.87%)
Mutual labels:  bert
KAREN
KAREN: Unifying Hatespeech Detection and Benchmarking
Stars: ✭ 18 (-53.85%)
Mutual labels:  bert

This is a java version of Chinese tokenization descried in BERT, including basic tokenization and wordpiece tokenization.

Motivation

In production, we usually deploy the BERT related model by tensorflow serving for high performance and flexibility. However, our application may not developed by python. Hence, we have to rewrite the tokenization module.

Usage

Just run Preprocess.java, you can get result. Now, it support single and pair sentence both.

Moreover, for Chinese natural language processing, we add full turn to half angle and uppercase to lowercase operation.

Reporting issues

Please let me know, if you encounter any problems.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].