All Projects → clips → Dutchembeddings

clips / Dutchembeddings

Licence: gpl-2.0
Repository for the word embeddings experiments described in "Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource", presented at LREC 2016.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dutchembeddings

Research2vec
Representing research papers as vectors / latent representations.
Stars: ✭ 192 (+170.42%)
Mutual labels:  vector, embeddings
Ml Ai Experiments
All my experiments with AI and ML
Stars: ✭ 107 (+50.7%)
Mutual labels:  word, embeddings
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+469.01%)
Mutual labels:  word, embeddings
Sensegram
Making sense embedding out of word embeddings using graph-based word sense induction
Stars: ✭ 209 (+194.37%)
Mutual labels:  word, embeddings
Vectorai
Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.
Stars: ✭ 195 (+174.65%)
Mutual labels:  vector, embeddings
Vectorhub
Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)
Stars: ✭ 317 (+346.48%)
Mutual labels:  vector, embeddings
Finalfrontier
Context-sensitive word embeddings with subwords. In Rust.
Stars: ✭ 61 (-14.08%)
Mutual labels:  word, embeddings
Anki.vector.sdk
Anki Vector .NET SDK
Stars: ✭ 47 (-33.8%)
Mutual labels:  vector
Animatedpencil
Animated Pencil Action view for Android
Stars: ✭ 61 (-14.08%)
Mutual labels:  vector
Libgenerics
libgenerics is a minimalistic and generic library for C basic data structures.
Stars: ✭ 42 (-40.85%)
Mutual labels:  vector
Desktopeditors
An office suite that combines text, spreadsheet and presentation editors allowing to create, view and edit local documents
Stars: ✭ 1,008 (+1319.72%)
Mutual labels:  word
Word Checker
🇨🇳🇬🇧Chinese and English word spelling corrector.(中文易错别字检测,中文拼写检测纠正。英文单词拼写校验工具)
Stars: ✭ 48 (-32.39%)
Mutual labels:  word
Documentbuilder
ONLYOFFICE Document Builder is powerful text, spreadsheet, presentation and PDF generating tool
Stars: ✭ 61 (-14.08%)
Mutual labels:  word
Word2html
a quick and dirty script to convert a Word (docx) document to html.
Stars: ✭ 44 (-38.03%)
Mutual labels:  word
Entity embeddings categorical
Discover relevant information about categorical data with entity embeddings using Neural Networks (powered by Keras)
Stars: ✭ 67 (-5.63%)
Mutual labels:  embeddings
Matrix Puppet Hangouts
Matrix bridge for Google Hangouts
Stars: ✭ 42 (-40.85%)
Mutual labels:  vector
Ttpassgen
密码生成 flexible and scriptable password dictionary generator which can support brute-force、combination、complex rule mode etc...
Stars: ✭ 68 (-4.23%)
Mutual labels:  word
Umesimd
UME::SIMD A library for explicit simd vectorization.
Stars: ✭ 66 (-7.04%)
Mutual labels:  vector
Leaflet Geoman
🍂🗺️ The most powerful leaflet plugin for drawing and editing geometry layers
Stars: ✭ 1,088 (+1432.39%)
Mutual labels:  vector
Short Words
visualise lengthy words
Stars: ✭ 56 (-21.13%)
Mutual labels:  word

dutchembeddings

Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.

All embeddings are released under the CC-BY-SA-4.0 license.

The software is released under the GNU GPL 2.0.

These embeddings have been created with the support of Textgain®.

Embeddings

To download the embeddings, please click any of the links in the following table. In almost all cases, the 320-dimensional embeddings outperform the 160-dimensional embeddings.

Corpus 160 320
Roularta link (mirror) link (mirror)
Wikipedia link (mirror) link (mirror)
Sonar500 link (mirror) link (mirror)
Combined link (mirror) link (mirror)
COW - small (mirror), big (mirror)

See below for a usage explanation.

Citing

If you use any of the resources from this paper, please cite our paper, as follows:

@InProceedings{tulkens2016evaluating,
  author = {Stephan Tulkens and Chris Emmery and Walter Daelemans},
  title = {Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portorož, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
 }

Please also consider citing the corpora of the embeddings you use. Without the people who made the corpora, the embeddings could never have been created.

Usage

The embeddings are currently provided in .txt files which contain vectors in word2vec format, which is structured as follows:

The first line contains the size of the vectors and the vocabulary size, separated by a space.

Ex: 320 50000

Each line thereafter contains the vector data for a single word, and is presented as a string delimited by spaces. The first item on each line is the word itself, the n following items are numbers, representing the vector of length n. Because the items are represented as strings, these should be converted to floating point numbers.

Ex: hond 0.2 -0.542 0.253 etc.

If you use python, these files can be loaded with gensim or reach, as follows.

# Gensim
from gensim.models import word2vec

model = Word2Vec.load_word2vec_format("path/to/vector", binary=False)
katvec = model['kat']
model.most_similar('kat')

# Reach
from reach import Reach

r = Reach.load("path/to/vector")
katvec = r['kat']
r.most_similar('kat')

Relationship dataset

If you want to test the quality of your embeddings, you can use the relation.py script. This script takes a .txt file of predicates, and creates dataset which is used for evaluation.

This currently only works with the gensim word2vec models or the SPPMI model, as defined above.

Example:

from relation import Relation

# Load the predicates.
rel = Relation("data/question-words.txt")

# load a word2vec model
model = Word2vec.load_word2vec_format("path/to/model")

# Test the model
rel.test_model(model)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].