All Projects → IsaacChanghau → Word2VecfJava

IsaacChanghau / Word2VecfJava

Licence: MIT license
Word2VecfJava: Java implementation of Dependency-Based Word Embeddings and extensions

Programming Languages

java
68154 projects - #9 most used programming language
Roff
2310 projects

Projects that are alternatives of or similar to Word2VecfJava

Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+91064.29%)
Mutual labels:  word-embeddings, word-similarity
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+22821.43%)
Mutual labels:  word-embeddings
Lftm
Improving topic models LDA and DMM (one-topic-per-document model for short texts) with word embeddings (TACL 2015)
Stars: ✭ 168 (+1100%)
Mutual labels:  word-embeddings
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+1300%)
Mutual labels:  word-embeddings
Debiaswe
Remove problematic gender bias from word embeddings.
Stars: ✭ 175 (+1150%)
Mutual labels:  word-embeddings
Question Generation
Generating multiple choice questions from text using Machine Learning.
Stars: ✭ 227 (+1521.43%)
Mutual labels:  word-embeddings
Mimick
Code for Mimicking Word Embeddings using Subword RNNs (EMNLP 2017)
Stars: ✭ 152 (+985.71%)
Mutual labels:  word-embeddings
overview-and-benchmark-of-traditional-and-deep-learning-models-in-text-classification
NLP tutorial
Stars: ✭ 41 (+192.86%)
Mutual labels:  word-embeddings
Spanish Word Embeddings
Spanish word embeddings computed with different methods and from different corpora
Stars: ✭ 236 (+1585.71%)
Mutual labels:  word-embeddings
Jfasttext
Java interface for fastText
Stars: ✭ 193 (+1278.57%)
Mutual labels:  word-embeddings
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+1250%)
Mutual labels:  word-embeddings
Texthero
Text preprocessing, representation and visualization from zero to hero.
Stars: ✭ 2,407 (+17092.86%)
Mutual labels:  word-embeddings
Wordgcn
ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks
Stars: ✭ 230 (+1542.86%)
Mutual labels:  word-embeddings
Sifrank zh
基于预训练模型的中文关键词抽取方法(论文SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model 的中文版代码)
Stars: ✭ 175 (+1150%)
Mutual labels:  word-embeddings
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (+607.14%)
Mutual labels:  word-embeddings
Vec4ir
Word Embeddings for Information Retrieval
Stars: ✭ 188 (+1242.86%)
Mutual labels:  word-embeddings
Chameleon recsys
Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems
Stars: ✭ 202 (+1342.86%)
Mutual labels:  word-embeddings
two-stream-cnn
A two-stream convolutional neural network for learning abitrary similarity functions over two sets of training data
Stars: ✭ 24 (+71.43%)
Mutual labels:  word-embeddings
HiCE
Code for ACL'19 "Few-Shot Representation Learning for Out-Of-Vocabulary Words"
Stars: ✭ 56 (+300%)
Mutual labels:  word-embeddings
Koan
A word2vec negative sampling implementation with correct CBOW update.
Stars: ✭ 232 (+1557.14%)
Mutual labels:  word-embeddings

Word2VecfJava

Authour

It is a Java implementation of the paper: Dependency Based Word Embeddings, published by Levy et al. in ACL, and extensions.

This algorithm uses the Skip-Gram method and train with shallow neural network, the input corpus is pre-processed by Stanford Dependency Parser. For more information of word embedding technique, it is better to search the related information online. Usage already shown in examples.

Requirements

Notes

The Word2Vecf project is a modification of the original Word2Vec proposed by Mikolov, allowing:

  1. performing multiple iterations over the data.
  2. the use of arbitrary context features.
  3. dumping the context vectors at the end of the process

Unlike the original Word2Vec project, which can be used directly, the Word2Vecf needs some pre-computations, since the Word2Vecf DOES NOT handle vocabulary construction and DOES NOT read a sentence or paragraph as input directly.

The expected files are:

  1. word_vocabulary: file mapping words (strings) to their counts.
  2. context_vocabulary: file mapping contexts (strings) to their counts, used for constructing the sampling table for the negative training.
  3. training_data: textual file of word-context pairs. each pair takes a separate line. the format of a pair is "(word context)", i.e. space delimited, where and are strings. if we want to prefer some contexts over the others, we should construct the training data to contain the bias.

In order to make the project more usable, the pre-computations are implemented inside the project too. Since the Word2Vecf project is dependency-based word embeddings, the stanford dependency parser is used, more usage information can be found in its website.

Semantic Property Task

  • WordSim353: The WordSim353 set contains 353 word pairs. It was constructed by asking human subjects to rate the degree of semantic similarity or relatedness between two words on a numerical scale. The performance is measured by the Pearson correlation of the two word embeddings’ cosine distance and the average score given by the participants. [pdf]
  • TOEFL: The TOEFL set contains 80 multiple-choice synonym questions, each with 4 candidates. For example, the question word levied has choices: imposed (correct), believed, requested and correlated. Choose the nearest neighbor of the question word from the candidates based on the cosine distance and use the accuracy to measure the performance. [pdf]
  • Analogy: The analogy task has approximately 9K semantic and 10.5K syntactic analogy questions. The question are similar to “man is to (woman) as king is to queen” or “predict is to (predicting) as dance is to dancing”. Following the previous work, using the nearest neighbor of "queen − king + man" in the vocabulary as the answer. Additionally, the accuracy is used to measure the performance. This dataset is relatively large compared to the previous two sets; therefore, the results using this dataset are more stable than those using the previous two datasets. [pdf]

Reference

Other Information

Version Log.

Word2Vecf C Codes Usage

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].