All Projects → babylonhealth → fuzzymax

babylonhealth / fuzzymax

Licence: Apache-2.0 license
Code for the paper: Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors, ICLR 2019.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
sed
78 projects

Projects that are alternatives of or similar to fuzzymax

robot-mind-meld
A little game powered by word vectors
Stars: ✭ 31 (-27.91%)
Mutual labels:  word-embeddings, word-vectors
sister
SImple SenTence EmbeddeR
Stars: ✭ 66 (+53.49%)
Mutual labels:  word-embeddings
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+355.81%)
Mutual labels:  word-embeddings
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (+130.23%)
Mutual labels:  word-embeddings
Question Generation
Generating multiple choice questions from text using Machine Learning.
Stars: ✭ 227 (+427.91%)
Mutual labels:  word-embeddings
overview-and-benchmark-of-traditional-and-deep-learning-models-in-text-classification
NLP tutorial
Stars: ✭ 41 (-4.65%)
Mutual labels:  word-embeddings
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+339.53%)
Mutual labels:  word-embeddings
dhash-vips
vips-powered ruby gem to measure images similarity, implementing dHash and IDHash algorithms
Stars: ✭ 75 (+74.42%)
Mutual labels:  similarity-measures
Word2VecfJava
Word2VecfJava: Java implementation of Dependency-Based Word Embeddings and extensions
Stars: ✭ 14 (-67.44%)
Mutual labels:  word-embeddings
PaperShell
Nice and flexible template environment for papers written in LaTeX
Stars: ✭ 117 (+172.09%)
Mutual labels:  research-paper
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+7362.79%)
Mutual labels:  word-embeddings
Wordgcn
ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks
Stars: ✭ 230 (+434.88%)
Mutual labels:  word-embeddings
Trajectory-Analysis-and-Classification-in-Python-Pandas-and-Scikit-Learn
Formed trajectories of sets of points.Experimented on finding similarities between trajectories based on DTW (Dynamic Time Warping) and LCSS (Longest Common SubSequence) algorithms.Modeled trajectories as strings based on a Grid representation.Benchmarked KNN, Random Forest, Logistic Regression classification algorithms to classify efficiently t…
Stars: ✭ 41 (-4.65%)
Mutual labels:  similarity-measures
Chameleon recsys
Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems
Stars: ✭ 202 (+369.77%)
Mutual labels:  word-embeddings
notion-scholar
Reference management solution using Python and Notion.
Stars: ✭ 77 (+79.07%)
Mutual labels:  research-paper
Jfasttext
Java interface for fastText
Stars: ✭ 193 (+348.84%)
Mutual labels:  word-embeddings
Spanish Word Embeddings
Spanish word embeddings computed with different methods and from different corpora
Stars: ✭ 236 (+448.84%)
Mutual labels:  word-embeddings
HiCE
Code for ACL'19 "Few-Shot Representation Learning for Out-Of-Vocabulary Words"
Stars: ✭ 56 (+30.23%)
Mutual labels:  word-embeddings
wefe
WEFE: The Word Embeddings Fairness Evaluation Framework. WEFE is a framework that standardizes the bias measurement and mitigation in Word Embeddings models. Please feel welcome to open an issue in case you have any questions or a pull request if you want to contribute to the project!
Stars: ✭ 164 (+281.4%)
Mutual labels:  word-embeddings
mtss-gan
MTSS-GAN: Multivariate Time Series Simulation with Generative Adversarial Networks (by @firmai)
Stars: ✭ 77 (+79.07%)
Mutual labels:  similarity-measures

Note This repository is no longer actively maintained by Babylon Health. For further assistance, reach out to the paper authors.

FuzzyMax

FuzzyMax is an evaluation framework and a collection of fuzzy set similarity measures for word vectors described in

Vitalii Zhelezniak, Aleksandar Savkov, April Shen, Francesco Moramarco, Jack Flann, and Nils Y. Hammerla, Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors, ICLR 2019.

Similarity Measures

Word vectors alone are sufficient to achieve excellent performance on the semantic textual similarity tasks (STS) when sentence representations and similarity measures are derived using the ideas from fuzzy set theory.

The two important special cases described in the paper are MaxPool-Jaccard

import numpy as np


def max_jaccard(x, y):
    """
    MaxPool-Jaccard similarity measure between two sentences
    :param x: list of word embeddings for the first sentence
    :param y: list of word embeddings for the second sentence
    :return: similarity score between the two sentences
    """
    m_x = np.max(x, axis=0)
    m_x = np.maximum(m_x, 0, m_x)
    m_y = np.max(y, axis=0)
    m_y = np.maximum(m_y, 0, m_y)
    m_inter = np.sum(np.minimum(m_x, m_y))
    m_union = np.sum(np.maximum(m_x, m_y))
    return m_inter / m_union

and DynaMax-Jaccard

import numpy as np


def fuzzify(s, u):
    """
    Sentence fuzzifier.
    Computes membership vector for the sentence S with respect to the
    universe U
    :param s: list of word embeddings for the sentence
    :param u: the universe matrix U with shape (K, d)
    :return: membership vectors for the sentence
    """
    f_s = np.dot(s, u.T)
    m_s = np.max(f_s, axis=0)
    m_s = np.maximum(m_s, 0, m_s)
    return m_s


def dynamax_jaccard(x, y):
    """
    DynaMax-Jaccard similarity measure between two sentences
    :param x: list of word embeddings for the first sentence
    :param y: list of word embeddings for the second sentence
    :return: similarity score between the two sentences
    """
    u = np.vstack((x, y))
    m_x = fuzzify(x, u)
    m_y = fuzzify(y, u)

    m_inter = np.sum(np.minimum(m_x, m_y))
    m_union = np.sum(np.maximum(m_x, m_y))
    return m_inter / m_union

Dependencies

This code is written in Python 3. The requirements are listed in requirements.txt.

pip3 install -r requirements.txt

Evaluation tasks

The experimental framework derived from SentEval evaluates the similarity measures on the following datasets:

| STS 2012 | STS 2012 | STS 2014 | STS 2015 | STS 2016 |

To get all the datasets, run (in data/downstream/):

./get_sts_data.bash

This will automatically download and preprocess the downstream datasets, and store them in data/downstream (warning: for MacOS users, you may have to use p7zip instead of unzip).

Experiments

Word vectors files must be in a word2vec txt format and are placed in data/word_vectors/. The mapping from word vector model name to filename is found in evaluation/utils.py. Word count files (if required) are placed in data/misc/.

WORD_VEC_MAP = {
    'glove': 'glove.840B.300d.w2vformat.txt',
    'word2vec': 'GoogleNews-vectors-negative300.txt',
    'fasttext': 'fasttext-crawl-300d-2M.txt',
    'word2vec_skipgram': 'book_corpus_skip.txt',
    'word2vec_cbow': 'book_corpus_cbow.txt',
    'word2vec_conll': 'word2vec_conll17_skip.txt',
    'psl': 'paragram_300_sl999.w2vformat.txt',
    'ppxxl': 'paragram-phrase-XXL.w2vformat.txt',
    'pnmt': 'paragram-NMT.w2vformat.txt',
    'default': 'glove.840B.300d.w2vformat.txt'
}

All the experiments are located in evaluation. They include

  1. classical.py - classical Jaccard similarity for sets and multisets.
  2. conf_intervals.py - evaluates DynaMax-Jaccard against avg.-cosine and computes 95% BCa confidence intervals for the delta in performance.
  3. fuzzy_eval - DynaMax-Jaccard and Max-pool-Jaccard on all 6 word vectors. Can optionally enable SIF weights.
  4. sif.py - SIF + PCA (Arora et al. 2017)
  5. wmd.py - WMD (Kusner et al. 2015)

Feedback and Contact:

If this code is useful to your research, please consider citing

@inproceedings{
zhelezniak2018dont,
title={Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors},
author={Vitalii Zhelezniak and Aleksandar Savkov and April Shen and Francesco Moramarco and Jack Flann and Nils Y. Hammerla},
booktitle={International Conference on Learning Representations},
year={2019},
url={https://openreview.net/forum?id=SkxXg2C5FX},
}

Contact: Vitalii Zhelezniak [email protected]

Related work

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].