All Projects → benywon → ChineseBert

benywon / ChineseBert

Licence: other
This is a chinese Bert model specific for question answering

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to ChineseBert

Segmentit
任何 JS 环境可用的中文分词包,fork from leizongmin/node-segment
Stars: ✭ 139 (+479.17%)
Mutual labels:  chinese-nlp
Fancy Nlp
NLP for human. A fast and easy-to-use natural language processing (NLP) toolkit, satisfying your imagination about NLP.
Stars: ✭ 233 (+870.83%)
Mutual labels:  chinese-nlp
Chinese-automatic-speech-recognition
Chinese speech recognition
Stars: ✭ 147 (+512.5%)
Mutual labels:  chinese-nlp
G2pc
g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese
Stars: ✭ 155 (+545.83%)
Mutual labels:  chinese-nlp
Lac
百度NLP:分词,词性标注,命名实体识别,词重要性
Stars: ✭ 2,792 (+11533.33%)
Mutual labels:  chinese-nlp
Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
Stars: ✭ 1,813 (+7454.17%)
Mutual labels:  chinese-nlp
Chinese Chatbot
中文聊天机器人,基于10万组对白训练而成,采用注意力机制,对一般问题都会生成一个有意义的答复。已上传模型,可直接运行,跑不起来直播吃键盘。
Stars: ✭ 124 (+416.67%)
Mutual labels:  chinese-nlp
THUCKE
THU Chinese Keyphrase Extraction Toolkit
Stars: ✭ 116 (+383.33%)
Mutual labels:  chinese-nlp
Nlp4han
中文自然语言处理工具集【断句/分词/词性标注/组块/句法分析/语义分析/NER/N元语法/HMM/代词消解/情感分析/拼写检查】
Stars: ✭ 206 (+758.33%)
Mutual labels:  chinese-nlp
Electra with tensorflow
This is an implementation of electra according to the paper {ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators}
Stars: ✭ 13 (-45.83%)
Mutual labels:  chinese-nlp
Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: ✭ 2,441 (+10070.83%)
Mutual labels:  chinese-nlp
Weatherbot
一个基于 Rasa 的中文天气情况问询机器人(chatbot), 带 Web UI 界面
Stars: ✭ 186 (+675%)
Mutual labels:  chinese-nlp
ChineseNounPhraseExtraction
使用词性模板抽取中文语料中的名词短语
Stars: ✭ 18 (-25%)
Mutual labels:  chinese-nlp
Information Extraction Chinese
Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文实体识别与关系提取
Stars: ✭ 1,888 (+7766.67%)
Mutual labels:  chinese-nlp
ltp4j
ltp4j: Language Technology Platform For Java
Stars: ✭ 165 (+587.5%)
Mutual labels:  chinese-nlp
Gossiping Chinese Corpus
PTT 八卦版問答中文語料
Stars: ✭ 137 (+470.83%)
Mutual labels:  chinese-nlp
esapp
An unsupervised Chinese word segmentation tool.
Stars: ✭ 13 (-45.83%)
Mutual labels:  chinese-nlp
bert tokenization for java
This is a java version of Chinese tokenization descried in BERT.
Stars: ✭ 39 (+62.5%)
Mutual labels:  chinese-nlp
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-29.17%)
Mutual labels:  chinese-nlp
Chinese-Minority-PLM
CINO: Pre-trained Language Models for Chinese Minority (少数民族语言预训练模型)
Stars: ✭ 133 (+454.17%)
Mutual labels:  chinese-nlp

ChineseBert

This is a chinese Bert model specific for question answering. We provide two models, a large model which is a 16 layer 1024 transformer, and a small model with 8 layer and 512 hidden size. Our implementation is a different from the original paper https://arxiv.org/abs/1810.04805, in which we replace a position embedding with LSTM, which shows advantages when the text length varies a lot.

Currently it is run on python3 and pytorch


#Stats:

Data: 200m chinese internet question answering pairs.

tokenizer: we use the sentencepiece tokenizer with vocab size equal to 35,000

For both large and small model, we train it for 2m steps, which did not suffer from overfit problem

large model takes 12 days for one epoch on 8-GPU NV-LINK v100. Small model takes 2 days for one epoch on 8-GPU NV-LINK v100.


#Usage:

Fed with chinese question answer pair and get the combined representations.

You can refer to the main.py for more detail.

The model has been tested under sequence length less than 1024


As the torch model file is very large, you should download it from the google drive via get_model.sh

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].