All Projects → imgarylai → Bert Embedding

imgarylai / Bert Embedding

Licence: apache-2.0
🔡 Token level embeddings from BERT model on mxnet and gluonnlp

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Bert Embedding

Aws Machine Learning University Accelerated Nlp
Machine Learning University: Accelerated Natural Language Processing Class
Stars: ✭ 1,695 (+299.76%)
Mutual labels:  natural-language-processing, mxnet
Thinc
🔮 A refreshing functional take on deep learning, compatible with your favorite libraries
Stars: ✭ 2,422 (+471.23%)
Mutual labels:  natural-language-processing, mxnet
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+2910.14%)
Mutual labels:  natural-language-processing, word-embeddings
Flair
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Stars: ✭ 11,065 (+2509.67%)
Mutual labels:  natural-language-processing, word-embeddings
D2l Vn
Một cuốn sách tương tác về học sâu có mã nguồn, toán và thảo luận. Đề cập đến nhiều framework phổ biến (TensorFlow, Pytorch & MXNet) và được sử dụng tại 175 trường Đại học.
Stars: ✭ 402 (-5.19%)
Mutual labels:  natural-language-processing, mxnet
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+306.13%)
Mutual labels:  natural-language-processing, word-embeddings
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (-55.42%)
Mutual labels:  natural-language-processing, word-embeddings
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+228.77%)
Mutual labels:  natural-language-processing, word-embeddings
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+656.84%)
Mutual labels:  natural-language-processing, word-embeddings
Wordgcn
ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks
Stars: ✭ 230 (-45.75%)
Mutual labels:  natural-language-processing, word-embeddings
Danlp
DaNLP is a repository for Natural Language Processing resources for the Danish Language.
Stars: ✭ 111 (-73.82%)
Mutual labels:  natural-language-processing, word-embeddings
Biosentvec
BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences
Stars: ✭ 308 (-27.36%)
Mutual labels:  natural-language-processing, word-embeddings
Kadot
Kadot, the unsupervised natural language processing library.
Stars: ✭ 108 (-74.53%)
Mutual labels:  natural-language-processing, word-embeddings
Nlp Pretrained Model
A collection of Natural language processing pre-trained models.
Stars: ✭ 122 (-71.23%)
Mutual labels:  natural-language-processing, mxnet
Easy Bert
A Dead Simple BERT API for Python and Java (https://github.com/google-research/bert)
Stars: ✭ 106 (-75%)
Mutual labels:  natural-language-processing, word-embeddings
Vec4ir
Word Embeddings for Information Retrieval
Stars: ✭ 188 (-55.66%)
Mutual labels:  natural-language-processing, word-embeddings
Textblob Ar
Arabic support for textblob
Stars: ✭ 60 (-85.85%)
Mutual labels:  natural-language-processing, word-embeddings
D2l En
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 300 universities from 55 countries including Stanford, MIT, Harvard, and Cambridge.
Stars: ✭ 11,837 (+2691.75%)
Mutual labels:  natural-language-processing, mxnet
Gluon Nlp
NLP made easy
Stars: ✭ 2,344 (+452.83%)
Mutual labels:  natural-language-processing, mxnet
Autogluon
AutoGluon: AutoML for Text, Image, and Tabular Data
Stars: ✭ 3,920 (+824.53%)
Mutual labels:  natural-language-processing, mxnet

Bert Embeddings

[Deprecated] Thank you for checking this project. Unfortunately, I don't have time to maintain this project anymore. If you are interested in maintaing this project. Please create an issue and let me know.

Build Status codecov PyPI version Documentation Status

BERT, published by Google, is new way to obtain pre-trained language model word representation. Many NLP tasks are benefit from BERT to get the SOTA.

The goal of this project is to obtain the token embedding from BERT's pre-trained model. In this way, instead of building and do fine-tuning for an end-to-end NLP model, you can build your model by just utilizing or token embedding.

This project is implemented with @MXNet. Special thanks to @gluon-nlp team.

Install

pip install bert-embedding
# If you want to run on GPU machine, please install `mxnet-cu92`.
pip install mxnet-cu92

Usage

from bert_embedding import BertEmbedding

bert_abstract = """We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.
 Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.
 As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. 
BERT is conceptually simple and empirically powerful. 
It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%."""
sentences = bert_abstract.split('\n')
bert_embedding = BertEmbedding()
result = bert_embedding(sentences)

If you want to use GPU, please import mxnet and set context

import mxnet as mx
from bert_embedding import BertEmbedding

...

ctx = mx.gpu(0)
bert = BertEmbedding(ctx=ctx)

This result is a list of a tuple containing (tokens, tokens embedding)

For example:

first_sentence = result[0]

first_sentence[0]
# ['we', 'introduce', 'a', 'new', 'language', 'representation', 'model', 'called', 'bert', ',', 'which', 'stands', 'for', 'bidirectional', 'encoder', 'representations', 'from', 'transformers']
len(first_sentence[0])
# 18


len(first_sentence[1])
# 18
first_token_in_first_sentence = first_sentence[1]
first_token_in_first_sentence[1]
# array([ 0.4805648 ,  0.18369392, -0.28554988, ..., -0.01961522,
#        1.0207764 , -0.67167974], dtype=float32)
first_token_in_first_sentence[1].shape
# (768,)

OOV

There are three ways to handle oov, avg (default), sum, and last. This can be specified in encoding.

...
bert_embedding = BertEmbedding()
bert_embedding(sentences, 'sum')
...

Available pre-trained BERT models

book_corpus_wiki_en_uncased book_corpus_wiki_en_cased wiki_multilingual wiki_multilingual_cased wiki_cn
bert_12_768_12
bert_24_1024_16 x x x x

Example of using the large pre-trained BERT model from Google

from bert_embedding import BertEmbedding

bert_embedding = BertEmbedding(model='bert_24_1024_16', dataset_name='book_corpus_wiki_en_cased')

Source: gluonnlp

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].