All Projects → vinid → cade

vinid / cade

Licence: MIT License
Compass-aligned Distributional Embeddings. Align embeddings from different corpora

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to cade

LSCDetection
Data Sets and Models for Evaluation of Lexical Semantic Change Detection
Stars: ✭ 17 (-41.38%)
Mutual labels:  embeddings, lexical-semantics, semantic-change
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-24.14%)
Mutual labels:  word2vec, embeddings
Sensegram
Making sense embedding out of word embeddings using graph-based word sense induction
Stars: ✭ 209 (+620.69%)
Mutual labels:  word2vec, embeddings
navec
Compact high quality word embeddings for Russian language
Stars: ✭ 118 (+306.9%)
Mutual labels:  word2vec, embeddings
Dna2vec
dna2vec: Consistent vector representations of variable-length k-mers
Stars: ✭ 117 (+303.45%)
Mutual labels:  word2vec, embeddings
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (+420.69%)
Mutual labels:  word2vec, embeddings
Cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Stars: ✭ 224 (+672.41%)
Mutual labels:  word2vec, embeddings
Persian-Sentiment-Analyzer
Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی )
Stars: ✭ 30 (+3.45%)
Mutual labels:  word2vec, embeddings
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-6.9%)
Mutual labels:  word2vec, embeddings
word2vec-tsne
Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.
Stars: ✭ 59 (+103.45%)
Mutual labels:  word2vec, embeddings
Awesome Embedding Models
A curated list of awesome embedding models tutorials, projects and communities.
Stars: ✭ 1,486 (+5024.14%)
Mutual labels:  word2vec, embeddings
reach
Load embeddings and featurize your sentences.
Stars: ✭ 17 (-41.38%)
Mutual labels:  word2vec, embeddings
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+4706.9%)
Mutual labels:  word2vec, embeddings
Entity2rec
entity2rec generates item recommendation using property-specific knowledge graph embeddings
Stars: ✭ 159 (+448.28%)
Mutual labels:  word2vec, embeddings
Dict2vec
Dict2vec is a framework to learn word embeddings using lexical dictionaries.
Stars: ✭ 91 (+213.79%)
Mutual labels:  word2vec, embeddings
Deeplearning Nlp Models
A small, interpretable codebase containing the re-implementation of a few "deep" NLP models in PyTorch. Colab notebooks to run with GPUs. Models: word2vec, CNNs, transformer, gpt.
Stars: ✭ 64 (+120.69%)
Mutual labels:  word2vec, embeddings
Philo2vec
An implementation of word2vec applied to [stanford philosophy encyclopedia](http://plato.stanford.edu/)
Stars: ✭ 33 (+13.79%)
Mutual labels:  word2vec, embeddings
Finalfusion Rust
finalfusion embeddings in Rust
Stars: ✭ 35 (+20.69%)
Mutual labels:  word2vec, embeddings
SentimentAnalysis
(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset
Stars: ✭ 40 (+37.93%)
Mutual labels:  word2vec, embeddings
sentiment-analysis-of-tweets-in-russian
Sentiment analysis of tweets in Russian using Convolutional Neural Networks (CNN) with Word2Vec embeddings.
Stars: ✭ 51 (+75.86%)
Mutual labels:  word2vec, embeddings

Compass-aligned Distributional Embeddings

This package contains Python code to generate compass aligned distributional embeddings (CADE). Also known as Temporal Word Embeddings with a Compass (TWEC). Comparing word vectors in different corpora requires alignment. We propose a method to aligned distributional representation based on word2vec. This method is efficient and it is based on a simple heuristic: we train a general word embedding, the compass and we use this embedding to freeze one of the layers of the CBOW architecture.

See the AAAI and Arxiv pre-print papers for more details.

https://raw.githubusercontent.com/vinid/cade/master/img/CADE.png

CADE is easy to use!

https://raw.githubusercontent.com/vinid/cade/master/img/render1587824614545.gif

Reference

This work is based on the following papers: AAAI and Arxiv-preprint

  • Bianchi, F., Di Carlo, V., Nicoli, P., & Palmonari, M. (2020). Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora. Arxiv. https://arxiv.org/abs/2004.06519
  • Di Carlo, V., Bianchi, F., & Palmonari, M. (2019). Training Temporal Word Embeddings with a Compass. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 6326-6334. https://doi.org/10.1609/aaai.v33i01.33016326

Jump start Tutorial

Name Link
Use CADE to align the same text twice Open In Colab

Abstract

Word2vec is one of the most used algorithms to generate word embeddings because of a good mix of efficiency, quality of the generated representations and cognitive grounding. However, word meaning is not static and depends on the context in which words are used. Differences in word meaning that depends on time, location, topic, and other factors, can be studied by analyzing embeddings generated from different corpora in collections that are representative of these factors. For example, language evolution can be studied using a collection of news articles published in different time periods. In this paper, we present a general framework to support cross-corpora language studies with word embeddings, where embeddings generated from different corpora can be compared to find correspondences and differences in meaning across the corpora. CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora. In particular, we focus on providing solid evidence about the effectiveness, generality, and robustness of CADE. To this end, we conduct quantitative and qualitative experiments in different domains, from temporal word embeddings to language localization and topical analysis. The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available, yet providing a general method that can be used in a variety of domains. Finally, our experiments shed light on the conditions under which the alignment is reliable, which substantially depends on the degree of cross-corpora vocabulary overlap.

What's this About?

Different words assume different meaning in different contexts. Think for example of how people once used the word amazon to mainly refer to the forest. Or think about the differences between American and British English. This is what we usually call meaning shift. See some examples of meaning shifts:

https://raw.githubusercontent.com/vinid/cade/master/img/shift_meaning.png

Why not using standard word embeddings? Well, long story short, different embeddings generated from different corpora are not comparable: they need to be aligned!

With CADE we provide a method to align different corpora (in the same language) and to compare them. Alignment allow us to compare different word embeddings in different corpora using cosine similarity!

Here are some example of mappings between text about Pokemons (from the Reddit board) and text about Scientific stuff (again, Reddit) that you can learn with CADE.

For example, you can take the vector of the word Arceus, from the Pokemon corpus and find that it is very similar to the word god in the Science corpus. You wonder why? Arceus is the good of Pokemons! See some examples of mapping like this in the table, where we show the top-5 nearest neighbors of the mapped space!

https://raw.githubusercontent.com/vinid/cade/master/img/mappings.png

Installing

We use a custom/edited implementation of gensim, this WILL clash with your gensim installation, consider installing this inside a virtual environment

pip install -U cade

REMEMBER TO USE A VIRTUAL ENVIRONMENT

pip install git+https://github.com/vinid/gensim.git

Guide

  • Remember: when you call the training method of CADE the class creates a "model/" folder where it is going to save the trained objects. The compass will be trained as first element and it will be saved in that folder. If you want to overwrite it remember to set the parameter overwrite=True, otherwise it will reload the already trained compass.
  • What do you need: Different corpora you want to compare (i.e., text from 1991, text from 1992 / text from the New York Times, text from The Guardian ... etc...) and the concatenation of those text slices (the compass).
  • The compass should be the concatenation of the slice you want to align. In the next code section you will see that we are going to use arxiv papers text from two different years. The "compass.txt" file contains the concatenation of both slices.

How To Use

  • Training

Suppose you have corpora you want to compare text "arxiv_14.txt" and "arxiv_9.txt". First of all, create the concatenation of these two and create a "compass.txt" file. Now you can train the compass.

cat arxiv_14.txt arxiv_9.txt > compass.txt

Once you have the compass, you can run the tool

from cade.cade import CADE
from gensim.models.word2vec import Word2Vec
aligner = CADE(size=30)

# train the compass: the text should be the concatenation of the text from the slices
aligner.train_compass("examples/training/compass.txt", overwrite=False) # keep an eye on the overwrite behaviour

You can see that the class covers the same parameters the Gensim word2vec library has. After this first training you can train the slices:

# now you can train slices and they will be already aligned
# these are gensim word2vec objects
slice_one = aligner.train_slice("examples/training/arxiv_14.txt", save=True)
slice_two = aligner.train_slice("examples/training/arxiv_9.txt", save=True)

These two slices are now aligned and can be compared!

  • Load Data

You can load data has you do with gensim.

model1 = Word2Vec.load("model/arxiv_14.model")
model2 = Word2Vec.load("model/arxiv_9.model")

and you can start comparing it with standard methods

from scipy.spatial.distance import cosine
print(1 - cosine(model1["like"], model2["sign"]))

People

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].