All Projects → lanwuwei → BERTOverflow

lanwuwei / BERTOverflow

Licence: other
A Pre-trained BERT on StackOverflow Corpus

Projects that are alternatives of or similar to BERTOverflow

DeepNER
An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.
Stars: ✭ 9 (-77.5%)
Mutual labels:  named-entity-recognition, bert
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+277.5%)
Mutual labels:  named-entity-recognition, bert
banglabert
This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…
Stars: ✭ 186 (+365%)
Mutual labels:  named-entity-recognition, bert
bern
A neural named entity recognition and multi-type normalization tool for biomedical text mining
Stars: ✭ 151 (+277.5%)
Mutual labels:  named-entity-recognition, bert
Bert Bilstm Crf Ner
Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private Server services
Stars: ✭ 3,838 (+9495%)
Mutual labels:  named-entity-recognition, bert
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+6195%)
Mutual labels:  named-entity-recognition, bert
OpenUE
OpenUE是一个轻量级知识图谱抽取工具 (An Open Toolkit for Universal Extraction from Text published at EMNLP2020: https://aclanthology.org/2020.emnlp-demos.1.pdf)
Stars: ✭ 274 (+585%)
Mutual labels:  named-entity-recognition, bert
TorchBlocks
A PyTorch-based toolkit for natural language processing
Stars: ✭ 85 (+112.5%)
Mutual labels:  named-entity-recognition, bert
knowledge-graph-nlp-in-action
从模型训练到部署,实战知识图谱(Knowledge Graph)&自然语言处理(NLP)。涉及 Tensorflow, Bert+Bi-LSTM+CRF,Neo4j等 涵盖 Named Entity Recognition,Text Classify,Information Extraction,Relation Extraction 等任务。
Stars: ✭ 58 (+45%)
Mutual labels:  named-entity-recognition, bert
Mt Dnn
Multi-Task Deep Neural Networks for Natural Language Understanding
Stars: ✭ 1,871 (+4577.5%)
Mutual labels:  named-entity-recognition, bert
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+5487.5%)
Mutual labels:  named-entity-recognition, bert
soddi
StackOverflow Data Dump Importer. Forked from https://bitbucket.org/bitpusher/soddi/ after the original author passed away.
Stars: ✭ 74 (+85%)
Mutual labels:  stackoverflow
wisdomify
A BERT-based reverse dictionary of Korean proverbs
Stars: ✭ 95 (+137.5%)
Mutual labels:  bert
Answerable
Recommendation system for Stack Overflow unanswered questions
Stars: ✭ 13 (-67.5%)
Mutual labels:  stackoverflow
lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Stars: ✭ 75 (+87.5%)
Mutual labels:  named-entity-recognition
embedding study
中文预训练模型生成字向量学习,测试BERT,ELMO的中文效果
Stars: ✭ 94 (+135%)
Mutual labels:  bert
NLPDataAugmentation
Chinese NLP Data Augmentation, BERT Contextual Augmentation
Stars: ✭ 94 (+135%)
Mutual labels:  bert
cmrc2019
A Sentence Cloze Dataset for Chinese Machine Reading Comprehension (CMRC 2019)
Stars: ✭ 118 (+195%)
Mutual labels:  bert
Kaleido-BERT
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.
Stars: ✭ 252 (+530%)
Mutual labels:  bert
AliceMind
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab
Stars: ✭ 1,479 (+3597.5%)
Mutual labels:  bert


BERTOverflow

This repository contains pre-trained BERT on StackOverflow data, which has shown state-of-the-art performance (with CRF layer) on software domain NER. The checkpoints can be downloaded here.

For further details, see the accompanying paper: Code and Named Entity Recognition in StackOverflow

Note: this is just a reference for BERT pre-training with your own data. First, you need to download the original BERT codebase, then apply for TPU usage through TFRC, finally follow this readme for BERT pre-training.

Data

We extract 152M sentences from StackOverflow questions and answers.

Vocabulary

We create 80K cased WordPiece vocabulary with 2K different UNK symbols:

import tokenizers
bwpt = tokenizers.BertWordPieceTokenizer(
    vocab_file=None,
    add_special_tokens=True,
    unk_token='[UNK]',
    sep_token='[SEP]',
    cls_token='[CLS]',
    clean_text=True,
    lowercase=False,
    handle_chinese_chars=True,
    strip_accents=True,
    wordpieces_prefix='##'
)
bwpt.train(
    files=["all_lines_from_ques_ans_xml_excluded_ann.txt"],
    vocab_size=80000,
    min_frequency=30,
    limit_alphabet=2000,
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[MASK]', '[SEP]']
)
bwpt.save("./","soft-bert-vocab")

TF Records

We split large file and create TF Records parallelly:

split -l 400000 ../../data/my_data.all my-data- --verbose

ls ../saved_model/softbert/raw_txt_data/ | xargs -n 1 -P 16 -I{} python create_pretraining_data.py --input_file=../saved_model/softbert/raw_txt_data/{} --output_file=../saved_model/softbert/tf_records_data/{}.tfrecord --vocab_file=../saved_model/softbert/vocab.txt --max_seq_length=128 --max_predictions_per_seq=20 --masked_lm_prob=0.15 --random_seed=12345 --dupe_factor=5 --do_whole_word_mask=False --do_lower_case=False

Pre-training

The pre-training is conducted on TPU v2-8 with Tensorflow 1.15:

python3 run_pretraining.py --input_file=gs://softbert_data/processed_data/*.tfrecord --output_dir=gs://softbert_data/model_base/ --do_train=True --do_eval=True --bert_config_file=gs://softbert_data/model_base/bert_config.json --train_batch_size=512 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=1500000 --num_warmup_steps=10000 --learning_rate=1e-4 --use_tpu=True --tpu_name=$TPU_NAME --save_checkpoints_steps 100000
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].