All Projects → CyberZHG → keras-word-char-embd

CyberZHG / keras-word-char-embd

Licence: MIT License
Concatenate word and character embeddings in Keras

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to keras-word-char-embd

sentiment-analysis-of-tweets-in-russian
Sentiment analysis of tweets in Russian using Convolutional Neural Networks (CNN) with Word2Vec embeddings.
Stars: ✭ 51 (+18.6%)
Mutual labels:  embeddings
reach
Load embeddings and featurize your sentences.
Stars: ✭ 17 (-60.47%)
Mutual labels:  embeddings
Persian-Sentiment-Analyzer
Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی )
Stars: ✭ 30 (-30.23%)
Mutual labels:  embeddings
EmbeddedScrollView
Embedded UIScrollView for iOS.
Stars: ✭ 55 (+27.91%)
Mutual labels:  embeddings
Text and Audio classification with Bert
Text Classification in Turkish Texts with Bert
Stars: ✭ 34 (-20.93%)
Mutual labels:  embeddings
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+337.21%)
Mutual labels:  embeddings
code-compass
a contextual search engine for software packages built on import2vec embeddings (https://www.code-compass.com)
Stars: ✭ 33 (-23.26%)
Mutual labels:  embeddings
cade
Compass-aligned Distributional Embeddings. Align embeddings from different corpora
Stars: ✭ 29 (-32.56%)
Mutual labels:  embeddings
Recommender-Systems-with-Collaborative-Filtering-and-Deep-Learning-Techniques
Implemented User Based and Item based Recommendation System along with state of the art Deep Learning Techniques
Stars: ✭ 41 (-4.65%)
Mutual labels:  embeddings
FaceRecognition With FaceNet Android
Face Recognition using the FaceNet model and MLKit on Android.
Stars: ✭ 113 (+162.79%)
Mutual labels:  embeddings
IR2Vec
Implementation of IR2Vec, published in ACM TACO
Stars: ✭ 28 (-34.88%)
Mutual labels:  embeddings
CODER
CODER: Knowledge infused cross-lingual medical term embedding for term normalization. [JBI, ACL-BioNLP 2022]
Stars: ✭ 24 (-44.19%)
Mutual labels:  embeddings
ruimtehol
R package to Embed All the Things! using StarSpace
Stars: ✭ 95 (+120.93%)
Mutual labels:  embeddings
SentimentAnalysis
Sentiment Analysis: Deep Bi-LSTM+attention model
Stars: ✭ 32 (-25.58%)
Mutual labels:  embeddings
watset-java
An implementation of the Watset clustering algorithm in Java.
Stars: ✭ 24 (-44.19%)
Mutual labels:  embeddings
whatlies
Toolkit to help understand "what lies" in word embeddings. Also benchmarking!
Stars: ✭ 351 (+716.28%)
Mutual labels:  embeddings
embeddinghub
A vector database for machine learning embeddings.
Stars: ✭ 645 (+1400%)
Mutual labels:  embeddings
go2vec
Read and use word2vec vectors in Go
Stars: ✭ 44 (+2.33%)
Mutual labels:  embeddings
game2vec
TensorFlow implementation of word2vec applied on https://www.kaggle.com/tamber/steam-video-games dataset, using both CBOW and Skip-gram.
Stars: ✭ 62 (+44.19%)
Mutual labels:  embeddings
entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Stars: ✭ 96 (+123.26%)
Mutual labels:  embeddings

Word/Character Embeddings in Keras

Introduction

image

Out-of-vocabulary words are drawbacks of word embeddings. Sometimes both word and character features are used. The characters in a word are first mapped to character embeddings, then a bidirectional recurrent neural layer is used to encode the character embeddings to a single vector. The final feature of a word is the concatenation of the word embedding and the encoded character feature.

The repository contains some functions and a wrapper class that could be used to generate the first few layers that encodes the features of words and characters.

Install

pip install keras-word-char-embd

Demo

There is a sentiment analysis demo in the demo directory. Run the following commands, then your model should have about 70% accuracy:

cd demo
./get_data.sh
python sentiment_analysis.py

Functions

This section only introduces the basic usages of the functions. For more detailed information please refer to the demo and the doc comments describing the functions in the source code.

get_dicts_generator

The function returns a closure used to generate word and character dictionaries. The closure should be invoked for all the training sentences in order to record the frequencies of each word or character. After that, setting the parameter return_dict=True the dictionaries would be returned.

from keras_wc_embd import get_dicts_generator

sentences = [
    ['All', 'work', 'and', 'no', 'play'],
    ['makes', 'Jack', 'a', 'dull', 'boy', '.'],
]
dict_generator = get_dicts_generator(
    word_min_freq=2,
    char_min_freq=2,
    word_ignore_case=False,
    char_ignore_case=False,
)
for sentence in sentences:
    dict_generator(sentence)

word_dict, char_dict, max_word_len = dict_generator(return_dict=True)

You can generate dictionaries on your own, but make sure index 0 and index for <UNK> are preserved.

get_embedding_layer

Generate the first few layers that encodes words in a sentence:

from tensorflow import keras
from keras_wc_embd import get_embedding_layer

inputs, embd_layer = get_embedding_layer(
    word_dict_len=len(word_dict),
    char_dict_len=len(char_dict),
    max_word_len=max_word_len,
    word_embd_dim=300,
    char_embd_dim=50,
    char_hidden_dim=150,
    char_hidden_layer_type='lstm',
)
model = keras.models.Model(inputs=inputs, outputs=embd_layer)
model.summary()

The output shape of embd_layer should be (None, None, 600), which represents the batch size, the length of sentence and the length of encoded word feature.

char_hidden_layer_type could be 'lstm', 'gru', 'cnn', a Keras layer or a list of Keras layers. Remember to add MaskedConv1D and MaskedFlatten to custom objects if you are using 'cnn':

from tensorflow import keras
from keras_wc_embd import MaskedConv1D, MaskedFlatten

keras.models.load_model(filepath, custom_objects={
    'MaskedConv1D': MaskedConv1D,
    'MaskedFlatten': MaskedFlatten,
})

get_batch_input

The function is used to generate the batch inputs for the model.

from keras_wc_embd import get_batch_input

word_embd_input, char_embd_input = get_batch_input(
    sentences,
    max_word_len=max_word_len,
    word_dict=word_dict,
    char_dict=char_dict,
)

get_embedding_weights_from_file

A helper function that loads pre-trained embeddings for initializing the weights of the embedding layer. The format of the file should be similar to GloVe.

from keras_wc_embd import get_embedding_layer, get_embedding_weights_from_file

word_embd_weights = get_embedding_weights_from_file(word_dict, 'glove.6B.100d.txt', ignore_case=True)
inputs, embd_layer = get_embedding_layer(
    word_dict_len=len(word_dict),
    char_dict_len=len(char_dict),
    max_word_len=max_word_len,
    word_embd_dim=300,
    char_embd_dim=50,
    char_hidden_dim=150,
    word_embd_weights=word_embd_weights,
    char_hidden_layer_type='lstm',
)

Wrapper Class WordCharEmbd

There is a wrapper class that makes things easier.

from keras_wc_embd import WordCharEmbd

sentences = [
    ['All', 'work', 'and', 'no', 'play'],
    ['makes', 'Jack', 'a', 'dull', 'boy', '.'],
]
wc_embd = WordCharEmbd(
    word_min_freq=0,
    char_min_freq=0,
    word_ignore_case=False,
    char_ignore_case=False,
)
for sentence in sentences:
    wc_embd.update_dicts(sentence)

inputs, embd_layer = wc_embd.get_embedding_layer()
lstm_layer = keras.layers.LSTM(units=5, name='LSTM')(embd_layer)
softmax_layer = keras.layers.Dense(units=2, activation='softmax', name='Softmax')(lstm_layer)
model = keras.models.Model(inputs=inputs, outputs=softmax_layer)
model.compile(
    optimizer='adam',
    loss=keras.losses.sparse_categorical_crossentropy,
    metrics=[keras.metrics.sparse_categorical_accuracy],
)
model.summary()


def batch_generator():
    while True:
        yield wc_embd.get_batch_input(sentences), np.asarray([0, 1])

model.fit_generator(
    generator=batch_generator(),
    steps_per_epoch=200,
    epochs=1,
)

Citation

Several papers have done the same thing. Just choose the one you have seen.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].