All Projects → vecto-ai → word-benchmarks

vecto-ai / word-benchmarks

Licence: other
Benchmarks for intrinsic word embeddings evaluation.

Projects that are alternatives of or similar to word-benchmarks

Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+320%)
Mutual labels:  word2vec, word-embeddings, evaluation
CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+742.22%)
Mutual labels:  benchmark, evaluation
Koan
A word2vec negative sampling implementation with correct CBOW update.
Stars: ✭ 232 (+415.56%)
Mutual labels:  word2vec, word-embeddings
Awesome Semantic Segmentation
🤘 awesome-semantic-segmentation
Stars: ✭ 8,831 (+19524.44%)
Mutual labels:  benchmark, evaluation
Debiaswe
Remove problematic gender bias from word embeddings.
Stars: ✭ 175 (+288.89%)
Mutual labels:  word2vec, word-embeddings
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+335.56%)
Mutual labels:  word2vec, word-embeddings
DiscEval
Discourse Based Evaluation of Language Understanding
Stars: ✭ 18 (-60%)
Mutual labels:  benchmark, evaluation
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+3726.67%)
Mutual labels:  word2vec, word-embeddings
Evo
Python package for the evaluation of odometry and SLAM
Stars: ✭ 1,373 (+2951.11%)
Mutual labels:  benchmark, evaluation
Nas Benchmark
"NAS evaluation is frustratingly hard", ICLR2020
Stars: ✭ 126 (+180%)
Mutual labels:  benchmark, evaluation
Hpatches Benchmark
Python & Matlab code for local feature descriptor evaluation with the HPatches dataset.
Stars: ✭ 129 (+186.67%)
Mutual labels:  benchmark, evaluation
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+28262.22%)
Mutual labels:  word2vec, word-embeddings
word2vec-on-wikipedia
A pipeline for training word embeddings using word2vec on wikipedia corpus.
Stars: ✭ 68 (+51.11%)
Mutual labels:  word2vec, word-embeddings
Chameleon recsys
Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems
Stars: ✭ 202 (+348.89%)
Mutual labels:  word2vec, word-embeddings
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (+182.22%)
Mutual labels:  word2vec, word-embeddings
Superpixel Benchmark
An extensive evaluation and comparison of 28 state-of-the-art superpixel algorithms on 5 datasets.
Stars: ✭ 275 (+511.11%)
Mutual labels:  benchmark, evaluation
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+2997.78%)
Mutual labels:  word2vec, word-embeddings
Dna2vec
dna2vec: Consistent vector representations of variable-length k-mers
Stars: ✭ 117 (+160%)
Mutual labels:  word2vec, word-embeddings
Evalne
Source code for EvalNE, a Python library for evaluating Network Embedding methods.
Stars: ✭ 67 (+48.89%)
Mutual labels:  benchmark, evaluation
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (+120%)
Mutual labels:  word2vec, word-embeddings

Benchmarks for intrinsic evaluation word embeddings

THE DEVELOPMENT STOPPED, PROJECT TRANSFERRED TO VECTO

1. Word semantic similarity

This method is based on an idea that the distances between words in an embedding space could be evaluated through the human heuristic judgments on the actual semantic distances between these words (e.g., the distance between cup and mug defined in an continuous interval {0,1} would be 0.8 since these words are synonymous, but not really the same thing). The assessor is given a set of pairs of words and asked to assess the degree of similarity for each pair. The distances between these pairs are also collected in a word embeddings space, and the two obtained distances sets are compared. The more similar they are, the better are embeddings

  1. SimVerb-3500, 3 500 pairs of verbs assessed by semantic similarity (that means that pairs that are related but not similar have a fairly low rating) with a scale from 0 to 4.
  2. MEN (acronym for Marco, Elia and Nam), 3 000 pairs assessed by semantic relatedness with a discrete scale from 0 to 50.
  3. RW (acronym for Rare Word), 2 034 pairs of words with low occurrences (rare words) assessed by semantic similarity with a scale from 0 to 10.
  4. SimLex-999, 999 pairs assessed with a strong respect to semantic similarity with a scale from 0 to 10.
  5. SemEval-2017, 500 pairs assessed by semantic similarity with a scale from 0 to 4 prepared for the SemEval-2017 Task 2 (Multilingual and Cross-lingual Semantic Word Similarity). Notably, dataset contains not only words, but also collocations (e.g. climate change).
  6. MTurk-771 (acronym for Mechanical Turk), 771 pairs assessed by semantic relatedness with a scale from 0 to 5.
  7. WordSim-353, 353 pairs assessed by semantic similarity (however, some researchers find the instructions for assessors ambiguous with respect to similarity and association) with a scale from 0 to 10.
  8. MTurk-287, 287 pairs assessed by semantic relatedness with a scale from 0 to 5.
  9. WordSim-353-REL, 252 pairs, a subset of WordSim-353 containing no pairs of similar concepts.
  10. WordSim-353-SIM, 203 pairs, a subset of WordSim-353 containing similar or unassociated (to mark all pairs that receive a low rating as unassociated) pairs.
  11. Verb-143, 143 pairs of verbs assessed by semantic similarity with a scale from 0 to 4.
  12. YP-130 (acronym for Yang and Powers), 130 pairs of verbs assessed by semantic similarity with a scale from 0 to 4.
  13. RG-65 (acronym for Rubenstein and Goodenough), 65 pairs assessed by semantic similarity with a scale from 0 to 4.
  14. MC-30 (acronym for Miller and Charles), 30 pairs, a subset of RG-65 which contains 10 pairs with high similarity, 10 with middle similarity and 10 with low similarity.

2. Word analogy

This evaluation method (in certain works also called analogical reasoning, linguistic regularities and word semantic coherence) is the second most popular method of word embeddings evaluation. It is based on the idea that arithmetic operations in a word vector space could be predicted by humans: given a set of three words, $a$, $a*$ and $b$, the task is to identify such word $b*$ that the relation $b$:$b*$ is the same as the relation $a$:$a*$. For instance, one has words $a=Paris$, $b=France$, $c=Moscow$. Then the target word would be $Russia$ since the relation $a:b$ is $capital:country$, so one needs need to find the capital of which country is Moscow.

  1. WordRep, 118 292 623 analogy questions (4-word tuples) divided into 26 semantic classes, a superset of Google Analogy with additional data from WordNet.
  2. BATS (acronym for Bigger Analogy Test Set), 99 200 questions divided into 4 classes (inflectional morphology, derivational morphology, lexicographic semantics and encyclopedic semantics) and 10 smaller subclasses.
  3. Google Analogy (also called Semantic-Syntactic Word Relationship Dataset), 19 544 questions divided into 2 classes (morphological relations and semantic relations) and 10 smaller subclasses (8 869 semantic questions and 10 675 morphological questions).
  4. SemEval-2012, 10 014 questions divided into 10 semantic classes and 79 subclasses prepared for the SemEval-2017 Task 2 (Measuring Degrees of Relational Similarity).
  5. MSR (acronym for Microsoft Research Syntactic Analogies), 8 000 questions divided into 16 morphological classes.
  6. SAT (acronym for Scholastic Aptitude Test), 5 610 questions divided into 374 semantic classes.
  7. JAIR (acronym for Journal of Artificial Intelligence Research), 430 questions divided into 20 semantic classes. Notably, dataset contains not only words but collocations (like solar system).

3. Concept categorization

This method (also called word clustering) evaluates an ability of word embeddings space to be clustered. Given a set of words, the task is to split it into subsets of words belonging to different categories (for example, for words $dog$, $elephant$, $robin$, $crow$ the first two make one cluster which is $mammals$ and the last two form another second cluster which is $birds$; the cluster name is not necessary to be formulated). The amount of clusters should be defined. Possible critique of such method could address the question of either choosing the most appropriate clustering algorithm or choosing the most adequate metric for evaluating clustering quality.

  1. BM (acronym for Battig and Montague), 5 321 words divided into 56 categories.
  2. AP (acronym for Almuhareb and Poesio), 402 words divided into 21 categories.
  3. BLESS (acronym for Baroni and Lenci Evaluation of Semantic Spaces), 200 words divided into 27 semantics classes. Despite the fact that BLESS was designed for another type for evaluation, it is also possible to use this dataset in a word categorization task.
  4. ESSLLI-2008 (acronym for the European Summer School in Logic, Language and Information), 45 words divided into 9 semantic classes (or 5 in less detailed categorization); the dataset was used in a shared task on a Lexical Semantics Workshop on ESSLI-2008.

4. Outlier word detection

This method evaluates the same feature of word embeddings as the word categorization method (it also proposes clustering), but the task is not to divide a set of words into certain amount of clusters, but to identify a semantically anomalous word in an already formed cluster (for example, for a set ${orange, banana, lemon, book, orange}$ which are mostly fruits, the word $book$ is the outlier since it is not a fruit).

  1. 8-8-8 Dataset, 8 clusters, each is represented by a set of 8 words with 8 outliers.
  2. WordSim-500, 500 clusters, each is represented by a set of 8 words with 5 to 7 outliers.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].