Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → lanwuwei → BERTOverflow

lanwuwei / BERTOverflow

Licence: other

A Pre-trained BERT on StackOverflow Corpus

Labels

stackoverflow named-entity-recognition bert

Projects that are alternatives of or similar to BERTOverflow

An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.

Stars: ✭ 9 (-77.5%)

Mutual labels: named-entity-recognition, bert

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…

Stars: ✭ 151 (+277.5%)

Mutual labels: named-entity-recognition, bert

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…

Stars: ✭ 186 (+365%)

Mutual labels: named-entity-recognition, bert

A neural named entity recognition and multi-type normalization tool for biomedical text mining

Stars: ✭ 151 (+277.5%)

Mutual labels: named-entity-recognition, bert

Bert Bilstm Crf Ner

Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private Server services

Stars: ✭ 3,838 (+9495%)

Mutual labels: named-entity-recognition, bert

State of the Art Natural Language Processing

Stars: ✭ 2,518 (+6195%)

Mutual labels: named-entity-recognition, bert

OpenUE是一个轻量级知识图谱抽取工具 (An Open Toolkit for Universal Extraction from Text published at EMNLP2020: https://aclanthology.org/2020.emnlp-demos.1.pdf)

Stars: ✭ 274 (+585%)

Mutual labels: named-entity-recognition, bert

A PyTorch-based toolkit for natural language processing

Stars: ✭ 85 (+112.5%)

Mutual labels: named-entity-recognition, bert

knowledge-graph-nlp-in-action

从模型训练到部署，实战知识图谱(Knowledge Graph)&自然语言处理(NLP)。涉及 Tensorflow, Bert+Bi-LSTM+CRF,Neo4j等涵盖 Named Entity Recognition,Text Classify,Information Extraction,Relation Extraction 等任务。

Stars: ✭ 58 (+45%)

Mutual labels: named-entity-recognition, bert

Multi-Task Deep Neural Networks for Natural Language Understanding

Stars: ✭ 1,871 (+4577.5%)

Mutual labels: named-entity-recognition, bert

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Stars: ✭ 2,235 (+5487.5%)

Mutual labels: named-entity-recognition, bert

StackOverflow Data Dump Importer. Forked from https://bitbucket.org/bitpusher/soddi/ after the original author passed away.

Stars: ✭ 74 (+85%)

Mutual labels: stackoverflow

A BERT-based reverse dictionary of Korean proverbs

Stars: ✭ 95 (+137.5%)

Mutual labels: bert

Recommendation system for Stack Overflow unanswered questions

Stars: ✭ 13 (-67.5%)

Mutual labels: stackoverflow

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.

Stars: ✭ 75 (+87.5%)

Mutual labels: named-entity-recognition

embedding study

中文预训练模型生成字向量学习，测试BERT，ELMO的中文效果

Stars: ✭ 94 (+135%)

Mutual labels: bert

NLPDataAugmentation

Chinese NLP Data Augmentation， BERT Contextual Augmentation

Stars: ✭ 94 (+135%)

Mutual labels: bert

A Sentence Cloze Dataset for Chinese Machine Reading Comprehension (CMRC 2019)

Stars: ✭ 118 (+195%)

Mutual labels: bert

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.

Stars: ✭ 252 (+530%)

Mutual labels: bert

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

Stars: ✭ 1,479 (+3597.5%)

Mutual labels: bert

View All Similar Projects ➔

BERTOverflow

This repository contains pre-trained BERT on StackOverflow data, which has shown state-of-the-art performance (with CRF layer) on software domain NER. The checkpoints can be downloaded here.

For further details, see the accompanying paper: Code and Named Entity Recognition in StackOverflow

Note: this is just a reference for BERT pre-training with your own data. First, you need to download the original BERT codebase, then apply for TPU usage through TFRC, finally follow this readme for BERT pre-training.

Data

We extract 152M sentences from StackOverflow questions and answers.

Vocabulary

We create 80K cased WordPiece vocabulary with 2K different UNK symbols:

import tokenizers
bwpt = tokenizers.BertWordPieceTokenizer(
    vocab_file=None,
    add_special_tokens=True,
    unk_token='[UNK]',
    sep_token='[SEP]',
    cls_token='[CLS]',
    clean_text=True,
    lowercase=False,
    handle_chinese_chars=True,
    strip_accents=True,
    wordpieces_prefix='##'
)
bwpt.train(
    files=["all_lines_from_ques_ans_xml_excluded_ann.txt"],
    vocab_size=80000,
    min_frequency=30,
    limit_alphabet=2000,
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[MASK]', '[SEP]']
)
bwpt.save("./","soft-bert-vocab")

TF Records

We split large file and create TF Records parallelly:

split -l 400000 ../../data/my_data.all my-data- --verbose

ls ../saved_model/softbert/raw_txt_data/ | xargs -n 1 -P 16 -I{} python create_pretraining_data.py --input_file=../saved_model/softbert/raw_txt_data/{} --output_file=../saved_model/softbert/tf_records_data/{}.tfrecord --vocab_file=../saved_model/softbert/vocab.txt --max_seq_length=128 --max_predictions_per_seq=20 --masked_lm_prob=0.15 --random_seed=12345 --dupe_factor=5 --do_whole_word_mask=False --do_lower_case=False

Pre-training

The pre-training is conducted on TPU v2-8 with Tensorflow 1.15:

python3 run_pretraining.py --input_file=gs://softbert_data/processed_data/*.tfrecord --output_dir=gs://softbert_data/model_base/ --do_train=True --do_eval=True --bert_config_file=gs://softbert_data/model_base/bert_config.json --train_batch_size=512 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=1500000 --num_warmup_steps=10000 --learning_rate=1e-4 --use_tpu=True --tpu_name=$TPU_NAME --save_checkpoints_steps 100000

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 40

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗