All Projects → RasaHQ → Whatlies

RasaHQ / Whatlies

Licence: apache-2.0
Toolkit to help understand "what lies" in word embeddings. Also benchmarking!

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Whatlies

Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (-38.62%)
Mutual labels:  embeddings
Parallax
Tool for interactive embeddings visualization
Stars: ✭ 192 (-21.95%)
Mutual labels:  embeddings
Magnetloss Pytorch
PyTorch implementation of a deep metric learning technique called "Magnet Loss" from Facebook AI Research (FAIR) in ICLR 2016.
Stars: ✭ 217 (-11.79%)
Mutual labels:  embeddings
Entity2rec
entity2rec generates item recommendation using property-specific knowledge graph embeddings
Stars: ✭ 159 (-35.37%)
Mutual labels:  embeddings
Datastories Semeval2017 Task4
Deep-learning model presented in "DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis".
Stars: ✭ 184 (-25.2%)
Mutual labels:  embeddings
Vectorai
Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.
Stars: ✭ 195 (-20.73%)
Mutual labels:  embeddings
Keras Vgg16 Places365
Keras code and weights files for the VGG16-places365 and VGG16-hybrid1365 CNNs for scene classification
Stars: ✭ 138 (-43.9%)
Mutual labels:  embeddings
Cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Stars: ✭ 224 (-8.94%)
Mutual labels:  embeddings
Vec4ir
Word Embeddings for Information Retrieval
Stars: ✭ 188 (-23.58%)
Mutual labels:  embeddings
Cause
Code for the Recsys 2018 paper entitled Causal Embeddings for Recommandation.
Stars: ✭ 207 (-15.85%)
Mutual labels:  embeddings
Cofactor
CoFactor: Regularizing Matrix Factorization with Item Co-occurrence
Stars: ✭ 160 (-34.96%)
Mutual labels:  embeddings
Jodie
A PyTorch implementation of ACM SIGKDD 2019 paper "Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks"
Stars: ✭ 172 (-30.08%)
Mutual labels:  embeddings
Food2vec
🍔
Stars: ✭ 199 (-19.11%)
Mutual labels:  embeddings
Pytorch Nlp
Basic Utilities for PyTorch Natural Language Processing (NLP)
Stars: ✭ 1,996 (+711.38%)
Mutual labels:  embeddings
Kprn
Reasoning Over Knowledge Graph Paths for Recommendation
Stars: ✭ 220 (-10.57%)
Mutual labels:  embeddings
Elastiknn
Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
Stars: ✭ 139 (-43.5%)
Mutual labels:  embeddings
Research2vec
Representing research papers as vectors / latent representations.
Stars: ✭ 192 (-21.95%)
Mutual labels:  embeddings
Catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
Stars: ✭ 224 (-8.94%)
Mutual labels:  embeddings
Deepehr
Chronic Disease Prediction Using Medical Notes
Stars: ✭ 220 (-10.57%)
Mutual labels:  embeddings
Sensegram
Making sense embedding out of word embeddings using graph-based word sense induction
Stars: ✭ 209 (-15.04%)
Mutual labels:  embeddings

Downloads

whatlies

A library that tries to help you to understand (note the pun).

"What lies in word embeddings?"

This small library offers tools to make visualisation easier of both word embeddings as well as operations on them.

Feedback is welcome.

Produced

This project was initiated at Rasa as a by-product of our efforts in the developer advocacy and research teams. It's an open source project and community contributions are very welcome!

Features

This library has tools to help you understand what lies in word embeddings. This includes:

  • simple tools to create (interactive) visualisations
  • an api for vector arithmetic that you can visualise
  • support for many dimensionality reduction techniques like pca, umap and tsne
  • support for many language backends including spaCy, fasttext, tfhub, huggingface and bpemb
  • lightweight scikit-learn featurizer support for all these backends

Installation

You can install the package via pip;

pip install whatlies

This will install the base dependencies. Depending on the transformers and language backends that you'll be using you may want to install more. Here's all the possible installation settings you could go for.

pip install whatlies[tfhub]
pip install whatlies[transformers]

If you want it all you can also install via;

pip install whatlies[all]

Note that this will install dependencies but it will not install all the language models you might want to visualise. For example, you might still need to manually download spaCy models if you intend to use that backend.

Getting Started

For a quick overview, check out our introductory video on youtube. More in depth getting started guides can be found on the documentation page.

Examples

The idea is that you can load embeddings from a language backend and use mathematical operations on it.

from whatlies import EmbeddingSet
from whatlies.language import SpacyLanguage

lang = SpacyLanguage("en_core_web_md")
words = ["cat", "dog", "fish", "kitten", "man", "woman",
         "king", "queen", "doctor", "nurse"]

emb = EmbeddingSet(*[lang[w] for w in words])
emb.plot_interactive(x_axis=emb["man"], y_axis=emb["woman"])

You can even do fancy operations. Like projecting onto and away from vector embeddings! You can perform these on embeddings as well as sets of embeddings. In the example below we attempt to filter away gender bias using linear algebra operations.

orig_chart = emb.plot_interactive('man', 'woman')

new_ts = emb | (emb['king'] - emb['queen'])
new_chart = new_ts.plot_interactive('man', 'woman')

There's also things like pca and umap.

from whatlies.transformers import Pca, Umap

orig_chart = emb.plot_interactive('man', 'woman')
pca_plot = emb.transform(Pca(2)).plot_interactive()
umap_plot = emb.transform(Umap(2)).plot_interactive()

pca_plot | umap_plot

Scikit-Learn Support

Every language backend in this video is available as a scikit-learn featurizer as well.

import numpy as np
from whatlies.language import BytePairLanguage
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("embed", BytePairLanguage("en")),
    ("model", LogisticRegression())
])

X = [
    "i really like this post",
    "thanks for that comment",
    "i enjoy this friendly forum",
    "this is a bad post",
    "i dislike this article",
    "this is not well written"
]

y = np.array([1, 1, 1, 0, 0, 0])

pipe.fit(X, y)

Documentation

To learn more and for a getting started guide, check out the documentation.

Similar Projects

There are some similar projects out and we figured it fair to mention and compare them here.

Julia Bazińska & Piotr Migdal Web App

The original inspiration for this project came from this web app and this pydata talk. It is a web app that takes a while to load but it is really fun to play with. The goal of this project is to make it easier to make similar charts from jupyter using different language backends.

Tensorflow Projector

From google there's the tensorflow projector project. It offers highly interactive 3d visualisations as well as some transformations via tensorboard.

  • The tensorflow projector will create projections in tensorboard, which you can also load into jupyter notebook but whatlies makes visualisations directly.
  • The tensorflow projector supports interactive 3d visuals, which whatlies currently doesn't.
  • Whatlies offers lego bricks that you can chain together to get a visualisation started. This also means that you're more flexible when it comes to transforming data before visualising it.
Parallax

From Uber AI Labs there's parallax which is described in a paper here. There's a common mindset in the two tools; the goal is to use arbitrary user defined projections to understand embedding spaces better. That said, some differences that are worth to mention.

  • It relies on bokeh as a visualisation backend and offers a lot of visualisation types (like radar plots). Whatlies uses altair and tries to stick to simple scatter charts. Altair can export interactive html/svg but it will not scale as well if you've drawing many points at the same time.
  • Parallax is meant to be run as a stand-alone app from the command line while Whatlies is meant to be run from the jupyter notebook.
  • Parallax gives a full user interface while Whatlies offers lego bricks that you can chain together to get a visualisation started.
  • Whatlies relies on language backends (like spaCy, huggingface) to fetch word embeddings. Parallax allows you to instead fetch raw files on disk.
  • Parallax has been around for a while, Whatlies is more new and therefore more experimental.

Local Development

If you want to develop locally you can start by running this command.

make develop

Documentation

This is generated via

make docs

Citation

Please use the following citation when you found whatlies helpful for any of your work (find the whatlies paper here):

@inproceedings{warmerdam-etal-2020-going,
    title = "Going Beyond {T}-{SNE}: Exposing whatlies in Text Embeddings",
    author = "Warmerdam, Vincent  and
      Kober, Thomas  and
      Tatman, Rachael",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.8",
    doi = "10.18653/v1/2020.nlposs-1.8",
    pages = "52--60",
    abstract = "We introduce whatlies, an open source toolkit for visually inspecting word and sentence embeddings. The project offers a unified and extensible API with current support for a range of popular embedding backends including spaCy, tfhub, huggingface transformers, gensim, fastText and BytePair embeddings. The package combines a domain specific language for vector arithmetic with visualisation tools that make exploring word embeddings more intuitive and concise. It offers support for many popular dimensionality reduction techniques as well as many interactive visualisations that can either be statically exported or shared via Jupyter notebooks. The project documentation is available from https://rasahq.github.io/whatlies/.",
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].