All Projects → cod3licious → conec

cod3licious / conec

Licence: MIT license
Context Encoders (ConEc) as a simple but powerful extension of the word2vec model for learning word embeddings

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to conec

compress-fasttext
Tools for shrinking fastText models (in gensim format)
Stars: ✭ 124 (+520%)
Mutual labels:  word-embeddings
QuestionClustering
Clasificador de preguntas escrito en python 3 que fue implementado en el siguiente vídeo: https://youtu.be/qnlW1m6lPoY
Stars: ✭ 15 (-25%)
Mutual labels:  word-embeddings
NTUA-slp-nlp
💻Speech and Natural Language Processing (SLP & NLP) Lab Assignments for ECE NTUA
Stars: ✭ 19 (-5%)
Mutual labels:  word-embeddings
JoSH
[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Stars: ✭ 55 (+175%)
Mutual labels:  word-embeddings
SiameseCBOW
Implementation of Siamese CBOW using keras whose backend is tensorflow.
Stars: ✭ 14 (-30%)
Mutual labels:  word-embeddings
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (+35%)
Mutual labels:  word-embeddings
pair2vec
pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference
Stars: ✭ 62 (+210%)
Mutual labels:  word-embeddings
Naive-Resume-Matching
Text Similarity Applied to resume, to compare Resumes with Job Descriptions and create a score to rank them. Similar to an ATS.
Stars: ✭ 27 (+35%)
Mutual labels:  word-embeddings
word2vec-tsne
Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.
Stars: ✭ 59 (+195%)
Mutual labels:  word-embeddings
sentiment-analysis-of-tweets-in-russian
Sentiment analysis of tweets in Russian using Convolutional Neural Networks (CNN) with Word2Vec embeddings.
Stars: ✭ 51 (+155%)
Mutual labels:  word-embeddings
robot-mind-meld
A little game powered by word vectors
Stars: ✭ 31 (+55%)
Mutual labels:  word-embeddings
SIFRank
The code of our paper "SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model"
Stars: ✭ 96 (+380%)
Mutual labels:  word-embeddings
wikidata-corpus
Train Wikidata with word2vec for word embedding tasks
Stars: ✭ 109 (+445%)
Mutual labels:  word-embeddings
Active-Explainable-Classification
A set of tools for leveraging pre-trained embeddings, active learning and model explainability for effecient document classification
Stars: ✭ 28 (+40%)
Mutual labels:  word-embeddings
Word-recognition-EmbedNet-CAB
Code implementation for our ICPR, 2020 paper titled "Improving Word Recognition using Multiple Hypotheses and Deep Embeddings"
Stars: ✭ 19 (-5%)
Mutual labels:  word-embeddings
MorphologicalPriorsForWordEmbeddings
Code for EMNLP 2016 paper: Morphological Priors for Probabilistic Word Embeddings
Stars: ✭ 53 (+165%)
Mutual labels:  word-embeddings
materials-synthesis-generative-models
Public release of data and code for materials synthesis generation
Stars: ✭ 47 (+135%)
Mutual labels:  word-embeddings
codenames
Codenames AI using Word Vectors
Stars: ✭ 41 (+105%)
Mutual labels:  word-embeddings
SentimentAnalysis
Sentiment Analysis: Deep Bi-LSTM+attention model
Stars: ✭ 32 (+60%)
Mutual labels:  word-embeddings
context2vec
PyTorch implementation of context2vec from Melamud et al., CoNLL 2016
Stars: ✭ 18 (-10%)
Mutual labels:  word-embeddings

Context Encoders (ConEc)

With this code you can train and evaluate Context Encoders (ConEc), an extension of word2vec, which can learn word embeddings from large corpora and create out-of-vocabulary embeddings on the spot as well as distinguish between multiple meanings of words based on their local contexts. For further details on the model and experiments please refer to the paper - and of course if any of this code was helpful for your research, please consider citing it:

    @inproceedings{horn2017conecRepL4NLP,
      author       = {Horn, Franziska},
      title        = {Context encoders as a simple but powerful extension of word2vec},
      booktitle    = {Proceedings of the 2nd Workshop on Representation Learning for NLP},
      year         = {2017},
      organization = {Association for Computational Linguistics},
      pages        = {10--14}
    }

The code is intended for research purposes. It should run with Python 2.7 and 3 versions - no guarantees on this though (open an issue if you find a bug, please)!

installation

You either download the code from here and include the conec folder in your $PYTHONPATH or install (the library components only) via pip:

$ pip install conec

conec library components

dependencies: numpy, scipy

  • word2vec.py: code to train a standard word2vec model, adapted from the corresponding gensim implementation.
  • context2vec.py: code to build a sparse context matrix from a large collection of texts; this context matrix can then be multiplied with the corresponding word2vec embeddings to give the context encoder embeddings:
# get the text for training
sentences = Text8Corpus('data/text8')
# train the word2vec model
w2v_model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, vector_size=200, seed=3)
# get the global context matrix for the text
context_model = context2vec.ContextModel(sentences, min_count=w2v_model.min_count, window=w2v_model.window, wordlist=w2v_model.wv.index2word)
context_mat = context_model.get_context_matrix(fill_diag=False, norm='max')
# multiply the context matrix with the (length normalized) word2vec embeddings
# to get the context encoder (ConEc) embeddings
conec_emb = context_mat.dot(w2v_model.wv.vectors_norm)
# renormalize so the word embeddings have unit length again
conec_emb = conec_emb / np.array([np.linalg.norm(conec_emb, axis=1)]).T

examples

additional dependencies: sklearn

test_analogy.py and test_ner.py contain the code to replicate the analogy and named entity recognition (NER) experiments discussed in the aforementioned paper.

To run the analogy experiment, it is assumed that the text8 corpus or 1-billion corpus as well as the analogy questions are in a data directory.

To run the named entity recognition experiment, it is assumed that the corresponding training and test files are located in the data/conll2003 directory.

If you have any questions please don't hesitate to send me an email and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].