All Projects → rguthrie3 → MorphologicalPriorsForWordEmbeddings

rguthrie3 / MorphologicalPriorsForWordEmbeddings

Licence: other
Code for EMNLP 2016 paper: Morphological Priors for Probabilistic Word Embeddings

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to MorphologicalPriorsForWordEmbeddings

neuralnets-semantics
Word semantics Deep Learning with Vanilla Python, Keras, Theano, TensorFlow, PyTorch
Stars: ✭ 15 (-71.7%)
Mutual labels:  theano, word-embeddings
entrepot
A list of free GitHub.com hosted WordPress plugins, themes & blocks
Stars: ✭ 29 (-45.28%)
Mutual labels:  blocks
Final-year-project-deep-learning-models
Deep learning for freehand sketch object recognition
Stars: ✭ 22 (-58.49%)
Mutual labels:  theano
dasem
Danish Semantic analysis
Stars: ✭ 17 (-67.92%)
Mutual labels:  word-embeddings
mcthings
A Python framework for creating 3D scenes in Minecraft and Minetest
Stars: ✭ 44 (-16.98%)
Mutual labels:  blocks
aino-blocks
Aino blocks are a collection of Gutenberg editor blocks for page building in WordPress.
Stars: ✭ 57 (+7.55%)
Mutual labels:  blocks
VNMT
Code for "Variational Neural Machine Translation" (EMNLP2016)
Stars: ✭ 54 (+1.89%)
Mutual labels:  theano
sortboard
A small ES6 library for easy sorting and filtering of elements.
Stars: ✭ 29 (-45.28%)
Mutual labels:  blocks
DeepLearning-IDS
Network Intrusion Detection System using Deep Learning Techniques
Stars: ✭ 76 (+43.4%)
Mutual labels:  theano
word2vec-on-wikipedia
A pipeline for training word embeddings using word2vec on wikipedia corpus.
Stars: ✭ 68 (+28.3%)
Mutual labels:  word-embeddings
bihm
Bidirectional Helmholtz Machines
Stars: ✭ 40 (-24.53%)
Mutual labels:  theano
fetch-all-the-things
A list of *nix fetch utilities
Stars: ✭ 43 (-18.87%)
Mutual labels:  blocks
poet
Configuration-based post type, taxonomy, block category, and block registration for Sage 10.
Stars: ✭ 124 (+133.96%)
Mutual labels:  blocks
PersianNER
Named-Entity Recognition in Persian Language
Stars: ✭ 48 (-9.43%)
Mutual labels:  word-embeddings
Arabic-Word-Embeddings-Word2vec
Arabic Word Embeddings Word2vec
Stars: ✭ 26 (-50.94%)
Mutual labels:  word-embeddings
cudnn rnn theano benchmarks
No description or website provided.
Stars: ✭ 22 (-58.49%)
Mutual labels:  theano
pgsqlblocks
pgSqlBlocks - это standalone приложение, написанное на языке программирования Java, которое позволяет легко ориентироваться среди процессов и получать информацию о блокировках и ожидающих запросов в СУБД PostgreSQL. Отображается информация о состоянии подключения к БД, а также информация о процессах в БД.
Stars: ✭ 23 (-56.6%)
Mutual labels:  blocks
contextualLSTM
Contextual LSTM for NLP tasks like word prediction and word embedding creation for Deep Learning
Stars: ✭ 28 (-47.17%)
Mutual labels:  word-embeddings
chainDB
A noSQL database based on blockchain technology
Stars: ✭ 13 (-75.47%)
Mutual labels:  blocks
pair2vec
pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference
Stars: ✭ 62 (+16.98%)
Mutual labels:  word-embeddings

Morphological Priors for Probabilistic Neural Word Embeddings

================================= Implementation of Morphological Priors for Probabilistic Neural Word Embeddings.

model_demo

VarEmbed in Blocks

This is the implementation for the following paper, to appear at EMNLP 2016: Morphological Priors for Probabilistic Neural Word Embeddings. Parminder Bhatia, Robert Guthrie, Jacob Eisenstein.

Using LSTM's for word embeddings that incorporate word-level and morpheme-level information using Blocks and Fuel. LSTM code modified from https://github.com/johnarevalo/blocks-char-rnn.git.

Requirements

  • Install Blocks. Please see the documentation for more information.

  • Install Fuel. Please see the documentation for more information.

  • Install the Morfessor Python package.

Results

histogram

Usage

The input can be any raw, pre-tokenized text. This will walk through how to generate the Morfessor model, preprocess and package the data as NDArrays, and train the model.

You will need to train a Morfessor model on your data. A script for this has been provided. It will output a serialized Morfessor model for later use.

python train_morfessor.py --training-data <input.txt> --output <output.bin>

The data set needs to be preprocessed and formatted, using preprocess_data.py and make_dataset.py. The -h flag will give the arguments needed. Preprocessing is downcasing so that capitalization doesn't affect Morfessor.

python preprocess_data.py <textfile>.trn -o <output_file> -n <unks all but top N words>

python make_dataset.py <textfile>.trn -mf <morfessor_model.bin>

Next, run train.py to train the model. It will print statistics after each mini-batch.

python train.py <filename>.hdf5

Parameters like batch size, embedding dimension, and the number of epochs can be changed in the config.py file.

Last, word vectors can be output in the format word dim1 dim2 ..., with 1 word per line, via the output_word_vectors.py script. Provide it a vocab of vectors to output, as well as a serialized network from training.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].