Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

charles9n / Bert Sklearn

Licence: apache-2.0

a sklearn wrapper for Google's BERT model

Labels

jupyter-notebook pytorch nlp natural-language-processing scikit-learn named-entity-recognition transfer-learning ner language-model

Projects that are alternatives of or similar to Bert Sklearn

Pytorch Bert Crf Ner

KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)

Stars: ✭ 236 (+29.67%)

Mutual labels: jupyter-notebook, natural-language-processing, named-entity-recognition, ner

Turkish Bert Nlp Pipeline

Bert-base NLP pipeline for Turkish, Ner, Sentiment Analysis, Question Answering etc.

Stars: ✭ 85 (-53.3%)

Mutual labels: jupyter-notebook, natural-language-processing, named-entity-recognition, ner

Vietnamese Electra

Electra pre-trained model using Vietnamese corpus

Stars: ✭ 55 (-69.78%)

Mutual labels: jupyter-notebook, natural-language-processing, language-model

Phonlp

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)

Stars: ✭ 56 (-69.23%)

Mutual labels: named-entity-recognition, ner, language-model

Bond

BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision

Stars: ✭ 96 (-47.25%)

Mutual labels: natural-language-processing, named-entity-recognition, ner

Spacy Transformers

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Stars: ✭ 919 (+404.95%)

Mutual labels: natural-language-processing, transfer-learning, language-model

Nagisa Tutorial Pycon2019

Code for PyCon JP 2019 talk "Python による日本語自然言語処理〜系列ラベリングによる実世界テキスト分析〜"

Stars: ✭ 46 (-74.73%)

Mutual labels: jupyter-notebook, natural-language-processing, named-entity-recognition

Text Analytics With Python

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.

Stars: ✭ 1,132 (+521.98%)

Mutual labels: jupyter-notebook, natural-language-processing, scikit-learn

Bert Multitask Learning

BERT for Multitask Learning

Stars: ✭ 380 (+108.79%)

Mutual labels: jupyter-notebook, named-entity-recognition, ner

Bnlp

BNLP is a natural language processing toolkit for Bengali Language.

Stars: ✭ 127 (-30.22%)

Mutual labels: jupyter-notebook, named-entity-recognition, ner

Multilstm

keras attentional bi-LSTM-CRF for Joint NLU (slot-filling and intent detection) with ATIS

Stars: ✭ 122 (-32.97%)

Mutual labels: jupyter-notebook, named-entity-recognition, ner

Ncrfpp

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.

Stars: ✭ 1,767 (+870.88%)

Mutual labels: natural-language-processing, named-entity-recognition, ner

Entity Recognition Datasets

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

Stars: ✭ 891 (+389.56%)

Mutual labels: natural-language-processing, named-entity-recognition, ner

Awesome Bert Nlp

A curated list of NLP resources focused on BERT, attention mechanism, Transformer networks, and transfer learning.

Stars: ✭ 567 (+211.54%)

Mutual labels: natural-language-processing, transfer-learning, language-model

Ner blstm Crf

LSTM-CRF for NER with ConLL-2002 dataset

Stars: ✭ 51 (-71.98%)

Mutual labels: jupyter-notebook, named-entity-recognition, ner

Transformers Tutorials

Github repo with tutorials to fine tune transformers for diff NLP tasks

Stars: ✭ 384 (+110.99%)

Mutual labels: jupyter-notebook, natural-language-processing, named-entity-recognition

Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Stars: ✭ 2,235 (+1128.02%)

Mutual labels: named-entity-recognition, transfer-learning, ner

Vncorenlp

A Vietnamese natural language processing toolkit (NAACL 2018)

Stars: ✭ 354 (+94.51%)

Mutual labels: natural-language-processing, named-entity-recognition, ner

Spacy Streamlit

👑 spaCy building blocks and visualizers for Streamlit apps

Stars: ✭ 360 (+97.8%)

Mutual labels: natural-language-processing, named-entity-recognition, ner

Dat8

General Assembly's 2015 Data Science course in Washington, DC

Stars: ✭ 1,516 (+732.97%)

Mutual labels: jupyter-notebook, natural-language-processing, scikit-learn

View All Similar Projects ➔

scikit-learn wrapper to finetune BERT

A scikit-learn wrapper to finetune Google's BERT model for text and token sequence tasks based on the huggingface pytorch port.

Includes configurable MLP as final classifier/regressor for text and text pair tasks
Includes token sequence classifier for NER, PoS, and chunking tasks
Includes SciBERT and BioBERT pretrained models for scientific and biomedical domains.

Try in Google Colab!

installation

requires python >= 3.5 and pytorch >= 0.4.1

git clone -b master https://github.com/charles9n/bert-sklearn
cd bert-sklearn
pip install .

basic operation

model.fit(X,y) i.e finetune BERT

X: list, pandas dataframe, or numpy array of text, text pairs, or token lists
y : list, pandas dataframe, or numpy array of labels/targets

from bert_sklearn import BertClassifier
from bert_sklearn import BertRegressor
from bert_sklearn import load_model

# define model
model = BertClassifier()         # text/text pair classification
# model = BertRegressor()        # text/text pair regression
# model = BertTokenClassifier()  # token sequence classification

# finetune model
model.fit(X_train, y_train)

# make predictions
y_pred = model.predict(X_test)

# make probabilty predictions
y_pred = model.predict_proba(X_test)

# score model on test data
model.score(X_test, y_test)

# save model to disk
savefile='/data/mymodel.bin'
model.save(savefile)

# load model from disk
new_model = load_model(savefile)

# do stuff with new model
new_model.score(X_test, y_test)

See demo notebook.

model options

# try different options...
model.bert_model = 'bert-large-uncased'
model.num_mlp_layers = 3
model.max_seq_length = 196
model.epochs = 4
model.learning_rate = 4e-5
model.gradient_accumulation_steps = 4

# finetune
model.fit(X_train, y_train)

# do stuff...
model.score(X_test, y_test)

See options

hyperparameter tuning

from sklearn.model_selection import GridSearchCV

params = {'epochs':[3, 4], 'learning_rate':[2e-5, 3e-5, 5e-5]}

# wrap classifier in GridSearchCV
clf = GridSearchCV(BertClassifier(validation_fraction=0), 
                    params,
                    scoring='accuracy',
                    verbose=True)

# fit gridsearch 
clf.fit(X_train ,y_train)

See demo_tuning_hyperparameters notebook.

GLUE datasets

The train and dev data sets from the GLUE(Generalized Language Understanding Evaluation) benchmarks were used with bert-base-uncased model and compared againt the reported results in the Google paper and GLUE leaderboard.

	MNLI(m/mm)	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE
BERT base(leaderboard)	84.6/83.4	89.2	90.1	93.5	52.1	87.1	84.8	66.4
bert-sklearn	83.7/83.9	90.2	88.6	92.32	58.1	89.7	86.8	64.6

Individual runs can be found can be found here.

CoNLL-2003 Named Entity Recognition(NER)

NER results for CoNLL-2003 shared task

	dev f1	test f1
BERT paper	96.4	92.4
bert-sklearn	96.04	91.97

Span level stats on test:

processed 46666 tokens with 5648 phrases; found: 5740 phrases; correct: 5173.
accuracy:  98.15%; precision:  90.12%; recall:  91.59%; FB1:  90.85
              LOC: precision:  92.24%; recall:  92.69%; FB1:  92.46  1676
             MISC: precision:  78.07%; recall:  81.62%; FB1:  79.81  734
              ORG: precision:  87.64%; recall:  90.07%; FB1:  88.84  1707
              PER: precision:  96.00%; recall:  96.35%; FB1:  96.17  1623

See ner_english notebook for a demo using 'bert-base-cased' model.

NCBI Biomedical NER

NER results using bert-sklearn with SciBERT and BioBERT on the the NCBI disease Corpus name recognition task.

Previous SOTA for this task is 87.34 for f1 on the test set.

	test f1 (bert-sklearn)	test f1 (from papers)
BERT base cased	85.09	85.49
SciBERT basevocab cased	88.29	86.91
SciBERT scivocab cased	87.73	86.45
BioBERT pubmed_v1.0	87.86	87.38
BioBERT pubmed_pmc_v1.0	88.26	89.36
BioBERT pubmed_v1.1	87.26	NA

See ner_NCBI_disease_BioBERT_SciBERT notebook for a demo using SciBERT and BioBERT models.

See SciBERT paper and BioBERT paper for more info on the respective models.

Other examples

See IMDb notebook for a text classification demo on the Internet Movie Database review sentiment task.
See chunking_english notebook for a demo on syntactic chunking using the CoNLL-2000 chunking task data.
See ner_chinese notebook for a demo using 'bert-base-chinese' for Chinese NER.

tests

Run tests with pytest :

python -m pytest -sv tests/

references

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 182

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (8) 🔗