Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → tca19 → Dict2vec

tca19 / Dict2vec

Licence: gpl-3.0

Dict2vec is a framework to learn word embeddings using lexical dictionaries.

Programming Languages

python

139335 projects - #7 most used programming language

50402 projects - #5 most used programming language

Labels

word2vec embeddings word-embeddings

Projects that are alternatives of or similar to Dict2vec

word2vec-tsne

Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.

Stars: ✭ 59 (-35.16%)

Mutual labels: word2vec, word-embeddings, embeddings

Magnitude

A fast, efficient universal vector embedding utility package.

Stars: ✭ 1,394 (+1431.87%)

Mutual labels: word2vec, embeddings, word-embeddings

lda2vec

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019

Stars: ✭ 27 (-70.33%)

Mutual labels: word2vec, word-embeddings, embeddings

Dna2vec

dna2vec: Consistent vector representations of variable-length k-mers

Stars: ✭ 117 (+28.57%)

Mutual labels: word2vec, embeddings, word-embeddings

sentiment-analysis-of-tweets-in-russian

Sentiment analysis of tweets in Russian using Convolutional Neural Networks (CNN) with Word2Vec embeddings.

Stars: ✭ 51 (-43.96%)

Mutual labels: word2vec, word-embeddings, embeddings

game2vec

TensorFlow implementation of word2vec applied on https://www.kaggle.com/tamber/steam-video-games dataset, using both CBOW and Skip-gram.

Stars: ✭ 62 (-31.87%)

Mutual labels: word2vec, embeddings

cade

Compass-aligned Distributional Embeddings. Align embeddings from different corpora

Stars: ✭ 29 (-68.13%)

Mutual labels: word2vec, embeddings

Glove As A Tensorflow Embedding Layer

Taking a pretrained GloVe model, and using it as a TensorFlow embedding weight layer **inside the GPU**. Therefore, you only need to send the index of the words through the GPU data transfer bus, reducing data transfer overhead.

Stars: ✭ 85 (-6.59%)

Mutual labels: word2vec, word-embeddings

Deeplearning Nlp Models

A small, interpretable codebase containing the re-implementation of a few "deep" NLP models in PyTorch. Colab notebooks to run with GPUs. Models: word2vec, CNNs, transformer, gpt.

Stars: ✭ 64 (-29.67%)

Mutual labels: word2vec, embeddings

SWDM

SIGIR 2017: Embedding-based query expansion for weighted sequential dependence retrieval model

Stars: ✭ 35 (-61.54%)

Mutual labels: word2vec, word-embeddings

Vectorhub

Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)

Stars: ✭ 317 (+248.35%)

Mutual labels: word2vec, embeddings

Chinese Word Vectors

100+ Chinese Word Vectors 上百种预训练中文词向量

Stars: ✭ 9,548 (+10392.31%)

Mutual labels: embeddings, word-embeddings

Text-Analysis

Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.

Stars: ✭ 48 (-47.25%)

Mutual labels: word2vec, word-embeddings

Persian-Sentiment-Analyzer

Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی )

Stars: ✭ 30 (-67.03%)

Mutual labels: word2vec, embeddings

go2vec

Read and use word2vec vectors in Go

Stars: ✭ 44 (-51.65%)

Mutual labels: word2vec, embeddings

reach

Load embeddings and featurize your sentences.

Stars: ✭ 17 (-81.32%)

Mutual labels: word2vec, embeddings

Wego

Word Embeddings (e.g. Word2Vec) in Go!

Stars: ✭ 336 (+269.23%)

Mutual labels: word2vec, word-embeddings

Text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

Stars: ✭ 715 (+685.71%)

Mutual labels: word2vec, word-embeddings

Deep learning nlp

Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP

Stars: ✭ 407 (+347.25%)

Mutual labels: word2vec, word-embeddings

Philo2vec

An implementation of word2vec applied to [stanford philosophy encyclopedia](http://plato.stanford.edu/)

Stars: ✭ 33 (-63.74%)

Mutual labels: word2vec, embeddings

View All Similar Projects ➔

     Dict2vec : Learning Word Embeddings using Lexical Dictionaries
     ==============================================================

CONTENT

1. PREAMBLE
2. ABOUT
3. REQUIREMENTS
4. USAGE
  1. Train word embeddings
  2. Evaluate word embeddings
  3. Download Dict2vec pre-trained word embeddings
  4. Download Wikipedia training corpora and dictionary definitions
5. AUTHOR
6. COPYRIGHT
                     ------------------------------

PREAMBLE

This work is one of my contributions of my PhD thesis entitled "Improving methods to learn word representations for efficient semantic similarities computations" in which I propose new methods to learn better word embeddings. You can find and read my thesis freely available at https://github.com/tca19/phd-thesis.
ABOUT

This repository contains source code to train word embeddings with the Dict2vec model, which uses both Wikipedia and dictionary definitions during training. It also contains scripts to evaluate learned word embeddings (trained with Dict2vec or any other method), to download Wikipedia training corpora, to fetch dictionary definitions from online dictionaries and to generate strong and weak pairs from the definitions. Related paper describing the Dict2vec model can be found at https://www.aclweb.org/anthology/D17-1024/.

If you use this repository, please cite:

@inproceedings{tissier2017dict2vec, title = {Dict2vec : Learning Word Embeddings using Lexical Dictionaries}, author = {Tissier, Julien and Gravier, Christophe and Habrard, Amaury}, booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing}, month = {sep}, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, url = {https://www.aclweb.org/anthology/D17-1024}, doi = {10.18653/v1/D17-1024}, pages = {254--263}, }
REQUIREMENTS

To compile and run the Dict2vec model, you will need the programs:
- gcc (4.8.4 or newer)
- make
To evaluate the learned embeddings on the word similarity task, you will need to install on you system:
- python3
- numpy (python3 version)
- scipy (python3 version)
To fetch definitions from online dictionaries, you will need to install on your system:
- python3
To run demo scripts and download training data, you will also need a system with wget, bzip2, perl and bash installed.
USAGE
1. Train word embeddings
Before running the example script, open the file demo-train.sh and modify the line 62 so the variable THREADS is equal to the number of cores in your machine. By default, it is equal to 8, so if your machine only has 4 cores, update it to be:

THREADS=4

Run demo-train.sh to have a quick glimpse of Dict2vec performances.

./demo-train.sh

This will:
- download a training file of 50M words
- download strong and weak pairs for training
- compile Dict2vec source code into a binary executable
- train word embeddings with a dimension of 100
- evaluate the embeddings on 11 word similarity datasets
To directly compile the code and interact with the sotfware, run:

make && ./dict2vec

Full documentation of each possible parameters is displayed when you run ./dict2vec without any arguments.
1. Evaluate word embeddings
Run evaluate.py to evaluate trained word embeddings. Once the evaluation is done, you get something like this:

./evaluate.py embeddings.txt

Filename | AVG | MIN | MAX | STD | Missed words/pairs

Card-660.txt | 0.598| 0.598| 0.598| 0.000| 33% / 50% MC-30.txt | 0.861| 0.861| 0.861| 0.000| 0% / 0% MEN-TR-3k.txt | 0.746| 0.746| 0.746| 0.000| 0% / 0% MTurk-287.txt | 0.648| 0.648| 0.648| 0.000| 0% / 0% MTurk-771.txt | 0.675| 0.675| 0.675| 0.000| 0% / 0% RG-65.txt | 0.860| 0.860| 0.860| 0.000| 0% / 0% RW-STANFORD.txt | 0.505| 0.505| 0.505| 0.000| 1% / 2% SimLex999.txt | 0.452| 0.452| 0.452| 0.000| 0% / 0% SimVerb-3500.txt| 0.417| 0.417| 0.417| 0.000| 0% / 0% WS-353-ALL.txt | 0.725| 0.725| 0.725| 0.000| 0% / 0% WS-353-REL.txt | 0.637| 0.637| 0.637| 0.000| 0% / 0% WS-353-SIM.txt | 0.741| 0.741| 0.741| 0.000| 0% / 0% YP-130.txt | 0.635| 0.635| 0.635| 0.000| 0% / 0%

W.Average | 0.570

The script computes the Spearman's rank correlation score for some word similarity datasets, as well as the OOV rate for each dataset and the weighted average based on the number of pairs evaluated on each dataset. We provide the following evaluation datasets in data/eval/:
- Card-660 (Pilehvar et al., 2018)
- MC-30 (Miller and Charles, 1991)
- MEN (Bruni et al., 2014)
- MTurk-287 (Radinsky et al., 2011)
- MTurk-771 (Halawi et al., 2012)
- RG-65 (Rubenstein and Goodenough, 1965)
- RW (Luong et al., 2013)
- SimLex-999 (Hill et al., 2014)
- SimVerb-3500 (Gerz et al., 2016)
- WordSim-353 (Finkelstein et al., 2001)
- YP-130 (Yang and Powers, 2006)
This script is also able to evaluate several embeddings files at the same time, and compute the average score as well as the standard deviation. To evaluate several embeddings, simply add multiple filenames as arguments:

./evaluate.py embedding-1.txt embedding-2.txt embedding-3.txt

The evaluation script indicates:
- AVG: the average score of all embeddings for each dataset
- MIN: the minimum score of all embeddings for each dataset
- MAX: the maximum score of all embeddings for each dataset
- STD: the standard deviation score of all embeddings for each dataset
When you evaluate only one embedding, you get the same value for AVG/MIN/MAX and a standard deviation STD of 0.
1. Download Dict2vec pre-trained word embeddings
We provide word embeddings trained with the Dict2vec model on the July 2017 English version of Wikipedia. Vectors with dimension 100 (resp. 200) were trained on the first 50M (resp. 200M) words of this corpus whereas vectors with dimension 300 were trained on the full corpus. First line is composed of (number of words / dimension). Each following line contains the word and all its space separated vector values. If you use these word embeddings, please cite the paper as explained in section "2. ABOUT".
- dimension 100 [https://mega.nz/file/Y0RmyI5S#SlupdHC2R7wMpHYWhaN9wYEKxsxEmZO_7Z-64hHnwqM]
- dimension 200 [https://mega.nz/file/UowxyBKA#nbiP5Os6GXmk-dGFEZkuj4aS0Uewcd81Z2NWGvcc460]
- dimension 300 [https://mega.nz/file/Et53UJrB#O4TAagLBgrBRnEi2liWzhOHuAaVsxUqKRfARYgK_n4o]
You need to extract the embeddings before using them. Use the following command to do so:

tar xvjf dict2vec100.tar.bz2
1. Download Wikipedia training corpora and dictionary definitions
For Wikipedia corpora, you can generate the same 3 files (50M, 200M and full) we use for training in the paper by running ./wiki-dl.sh.

This script will download the full English Wikipedia dump of January 2021, uncompress it and directly feed it into Mahoney's parser script [1]. It also cuts the entire dump into two smaller datasets: one containing the first 50M tokens (enwiki-50M), and the other one containing the first 200M tokens (enwiki-200M). The training corpora have the following filesizes:
- enwiki-50M: 296MB
- enwiki-200M: 1.16GB
- enwiki-full: 29.5GB
[1] http://mattmahoney.net/dc/textdata#appendixa

For dictionary definitions, we provide scripts to download online definitions and generate strong/weak pairs based on these definitions. More information and full documentation can be found in the folder dict-dl/ of this repository.
AUTHOR

Written by Julien Tissier [email protected].
COPYRIGHT

This software is licensed under the GNU GPLv3 license. See the LICENSE file for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 91

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

tca19 / Dict2vec

Programming Languages

Labels

Projects that are alternatives of or similar to Dict2vec

Filename | AVG | MIN | MAX | STD | Missed words/pairs