Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → IsaacChanghau → Word2VecfJava

IsaacChanghau / Word2VecfJava

Licence: MIT license

Word2VecfJava: Java implementation of Dependency-Based Word Embeddings and extensions

Programming Languages

68154 projects - #9 most used programming language

2310 projects

Labels

nlp neural-network word-embeddings word2vec-model word-similarity lexical-substitution word2vecf

Projects that are alternatives of or similar to Word2VecfJava

Topic Modelling for Humans

Stars: ✭ 12,763 (+91064.29%)

Mutual labels: word-embeddings, word-similarity

Pytorch Sentiment Analysis

Tutorials on getting started with PyTorch and TorchText for sentiment analysis.

Stars: ✭ 3,209 (+22821.43%)

Mutual labels: word-embeddings

Improving topic models LDA and DMM (one-topic-per-document model for short texts) with word embeddings (TACL 2015)

Stars: ✭ 168 (+1100%)

Mutual labels: word-embeddings

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (+1300%)

Mutual labels: word-embeddings

Remove problematic gender bias from word embeddings.

Stars: ✭ 175 (+1150%)

Mutual labels: word-embeddings

Question Generation

Generating multiple choice questions from text using Machine Learning.

Stars: ✭ 227 (+1521.43%)

Mutual labels: word-embeddings

Code for Mimicking Word Embeddings using Subword RNNs (EMNLP 2017)

Stars: ✭ 152 (+985.71%)

Mutual labels: word-embeddings

overview-and-benchmark-of-traditional-and-deep-learning-models-in-text-classification

NLP tutorial

Stars: ✭ 41 (+192.86%)

Mutual labels: word-embeddings

Spanish Word Embeddings

Spanish word embeddings computed with different methods and from different corpora

Stars: ✭ 236 (+1585.71%)

Mutual labels: word-embeddings

Java interface for fastText

Stars: ✭ 193 (+1278.57%)

Mutual labels: word-embeddings

Germanwordembeddings

Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets

Stars: ✭ 189 (+1250%)

Mutual labels: word-embeddings

Text preprocessing, representation and visualization from zero to hero.

Stars: ✭ 2,407 (+17092.86%)

Mutual labels: word-embeddings

ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks

Stars: ✭ 230 (+1542.86%)

Mutual labels: word-embeddings

基于预训练模型的中文关键词抽取方法（论文SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model 的中文版代码）

Stars: ✭ 175 (+1150%)

Mutual labels: word-embeddings

Simple-Sentence-Similarity

Exploring the simple sentence similarity measurements using word embeddings

Stars: ✭ 99 (+607.14%)

Mutual labels: word-embeddings

Word Embeddings for Information Retrieval

Stars: ✭ 188 (+1242.86%)

Mutual labels: word-embeddings

Chameleon recsys

Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems

Stars: ✭ 202 (+1342.86%)

Mutual labels: word-embeddings

A two-stream convolutional neural network for learning abitrary similarity functions over two sets of training data

Stars: ✭ 24 (+71.43%)

Mutual labels: word-embeddings

Code for ACL'19 "Few-Shot Representation Learning for Out-Of-Vocabulary Words"

Stars: ✭ 56 (+300%)

Mutual labels: word-embeddings

A word2vec negative sampling implementation with correct CBOW update.

Stars: ✭ 232 (+1557.14%)

Mutual labels: word-embeddings

View All Similar Projects ➔

Word2VecfJava

It is a Java implementation of the paper: Dependency Based Word Embeddings, published by Levy et al. in ACL, and extensions.

This algorithm uses the Skip-Gram method and train with shallow neural network, the input corpus is pre-processed by Stanford Dependency Parser. For more information of word embedding technique, it is better to search the related information online. Usage already shown in examples.

Requirements

DL4J, its GitHub page: [link], and Maven source: [link].
ND4J, its GitHub page: [link], and Maven source: [link].
Stanford NLP, its GitHub page: [link], and Maven sources: [link] (For Maven, please import both corenlp and corenlp with classifier models snippets).
Guava, its Maven sources: [link].

Notes

The Word2Vecf project is a modification of the original Word2Vec proposed by Mikolov, allowing:

performing multiple iterations over the data.
the use of arbitrary context features.
dumping the context vectors at the end of the process

Unlike the original Word2Vec project, which can be used directly, the Word2Vecf needs some pre-computations, since the Word2Vecf DOES NOT handle vocabulary construction and DOES NOT read a sentence or paragraph as input directly.

The expected files are:

word_vocabulary: file mapping words (strings) to their counts.
context_vocabulary: file mapping contexts (strings) to their counts, used for constructing the sampling table for the negative training.
training_data: textual file of word-context pairs. each pair takes a separate line. the format of a pair is "(word context)", i.e. space delimited, where and are strings. if we want to prefer some contexts over the others, we should construct the training data to contain the bias.

In order to make the project more usable, the pre-computations are implemented inside the project too. Since the Word2Vecf project is dependency-based word embeddings, the stanford dependency parser is used, more usage information can be found in its website.

Semantic Property Task

WordSim353: The WordSim353 set contains 353 word pairs. It was constructed by asking human subjects to rate the degree of semantic similarity or relatedness between two words on a numerical scale. The performance is measured by the Pearson correlation of the two word embeddings’ cosine distance and the average score given by the participants. [pdf]
TOEFL: The TOEFL set contains 80 multiple-choice synonym questions, each with 4 candidates. For example, the question word levied has choices: imposed (correct), believed, requested and correlated. Choose the nearest neighbor of the question word from the candidates based on the cosine distance and use the accuracy to measure the performance. [pdf]
Analogy: The analogy task has approximately 9K semantic and 10.5K syntactic analogy questions. The question are similar to “man is to (woman) as king is to queen” or “predict is to (predicting) as dance is to dancing”. Following the previous work, using the nearest neighbor of "queen − king + man" in the vocabulary as the answer. Additionally, the accuracy is used to measure the performance. This dataset is relatively large compared to the previous two sets; therefore, the results using this dataset are more stable than those using the previous two datasets. [pdf]

Reference

eikdk/Word2VecJava
word2vec -- google sources, download
Yoav Goldberg/word2vecf
orenmel/lexsub
GoogleNews-vectors-negative300.bin (Pre-trained Google News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors))

Other Information

Word2Vecf C Codes Usage

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 14

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗