All Projects → akutuzov → Webvectors

akutuzov / Webvectors

Licence: gpl-3.0
Web-ify your word2vec: framework to serve distributional semantic models online

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Webvectors

Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-5.19%)
Mutual labels:  word2vec, gensim
Sense2vec
🦆 Contextually-keyed word vectors
Stars: ✭ 1,184 (+668.83%)
Mutual labels:  word2vec, gensim
Tadw
An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (-72.08%)
Mutual labels:  word2vec, gensim
Word2vec Tutorial
中文詞向量訓練教學
Stars: ✭ 426 (+176.62%)
Mutual labels:  word2vec, gensim
Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (-17.53%)
Mutual labels:  word2vec, gensim
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+412.99%)
Mutual labels:  word2vec, gensim
Critiquebrainz
Repository for Creative Commons licensed reviews
Stars: ✭ 59 (-61.69%)
Mutual labels:  flask, web-app
word2vec-pt-br
Implementação e modelo gerado com o treinamento (trigram) da wikipedia em pt-br
Stars: ✭ 34 (-77.92%)
Mutual labels:  word2vec, gensim
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+805.19%)
Mutual labels:  word2vec, gensim
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+737.66%)
Mutual labels:  word2vec, gensim
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+162.34%)
Mutual labels:  word2vec, gensim
Word2vec Spam Filter
Using word vectors to classify spam messages
Stars: ✭ 149 (-3.25%)
Mutual labels:  flask, word2vec
wordfish-python
extract relationships from standardized terms from corpus of interest with deep learning 🐟
Stars: ✭ 19 (-87.66%)
Mutual labels:  word2vec, gensim
Twitter sentiment analysis word2vec convnet
Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Network
Stars: ✭ 24 (-84.42%)
Mutual labels:  word2vec, gensim
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-80.52%)
Mutual labels:  word2vec, gensim
Word2vec
訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.
Stars: ✭ 48 (-68.83%)
Mutual labels:  word2vec, gensim
walklets
A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).
Stars: ✭ 94 (-38.96%)
Mutual labels:  word2vec, gensim
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (-66.23%)
Mutual labels:  word2vec, gensim
Musae
The reference implementation of "Multi-scale Attributed Node Embedding".
Stars: ✭ 75 (-51.3%)
Mutual labels:  word2vec, gensim
Role2vec
A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).
Stars: ✭ 134 (-12.99%)
Mutual labels:  word2vec, gensim

webvectors

Webvectors is a toolkit to serve vector semantic models (particularly, prediction-based word embeddings, as in word2vec or ELMo) over the web, making it easy to demonstrate their abilities to general public. It requires Python >= 3.6, and uses Flask, Gensim and simple_elmo under the hood.

Working demos:

  • https://rusvectores.org (for Russian)
  • http://vectors.nlpl.eu/explore/embeddings/ (for English and Norwegian)

The service can be either integrated into Apache web server as a WSGI application or run as a standalone server using Gunicorn (we recommend the latter option).

Logo

Brief installation instructions

  1. Clone WebVectors git repository (git clone https://github.com/akutuzov/webvectors.git) into a directory acessible by your web server.
  2. Install Apache for Apache integration or Gunicorn for standalone server.
  3. Install all the Python requirements (pip3 install -r requirements.txt)
  4. If you want to use PoS tagging for user queries, install UDPipe, Stanford CoreNLP, Freeling or other PoS-tagger of your choice.
  5. Configure the files:

For Apache installation variant

Add the following line to Apache configuration file:

WSGIScriptAlias /WEBNAME "PATH/syn.wsgi", where WEBNAME is the alias for your service relative to the server root (webvectors for http://example.com/webvectors), and PATH is your filesystem path to the WebVectors directory.

For all installation variants

In all *.wsgi and *.py files in your WebVectors directory, replace webvectors.cfg in the string config.read('webvectors.cfg') with the absolute path to the webvectors.cfg file.

Set up your service using the configuration file webvectors.cfg. Most important settings are:

  • `root` - absolute path to your _WebVectors_ directory (**NB: end it with a slash!**)
  • `temp` - absolute path to your temporary files directory
  • `font` - absolute path to a TTF font you want to use for plots (otherwise, the default system font will be used)
  • `detect_tag` - whether to use automatic PoS tagging
  • `default_search` - URL of search engine to use on individual word pages (for example, https://duckduckgo.com/?q=)

Tags

Models can use arbitrary tags assigned to words (for example, part-of-speech tags, as in boot_NOUN). If your models are trained on words with tags, you should switch this on in webvectors.cfg (use_tags variable). Then, WebVectors will allow users to filter their queries by tags. You also should specify the list of allowed tags (tags_list variable in webvectors.cfg) and the list of tags which will be shown to the user (tags.tsv file).

Models daemon

WebVectors uses a daemon, which runs in the background and actually processes all embedding-related tasks. It can also run on a different machine, if you want. Thus, in webvectors.cfg you should specify host and port that this daemon will listen at. After that, start the actual daemon script word2vec_server.py. It will load the models and open a listening socket. This daemon must be active permanently, so you may want to launch it using screen or something like this.

Models

The list of models you want to use is defined in the file models.tsv. It consists of tab-separated fields:

  • model identifier
  • model description
  • path to model
  • identifier of localized model name
  • is the model default or not
  • does the model contain PoS tags
  • training algorithm of the model (word2vec/fastText/etc)
  • size of the training corpus in words

Model identifier will be used as the name for checkboxes in the web pages, and it is also important that in the strings.csv file the same identifier is used when denoting model names.

Models can currently be in 4 formats:

  • plain text _word2vec_ models (ends with `.vec`);
  • binary _word2vec_ models (ends with `.bin`);
  • Gensim format _word2vec_ models (ends with `.model`);
  • Gensim format _fastText_ models (ends with `.model`).

WebVectors will automatically detect models format and load all of them into memory. The users will be able to choose among loaded models.

Localization

WebVectors uses the strings.csv file as the source of localized strings. It is a comma-separated file with 3 fields:

  • identifier
  • string in language 1
  • string in language 2

By default, language 1 is English and language 2 is Russian. This can be changed in webvectors.cfg.

Templates

Actual web pages shown to user are defined in the files templates/*.html. Tune them as you wish. The main menu is defined at base.html.

Statis files

If your application does not find the static files (bootstrap and js scripts), edit the variable static_url_path in run_syn.py. You should put there the absolute path to the data folder.

Query hints

If you want query hints to work, do not forget to compile your own list of hints (JSON format). Example of such a list is given in data/example_vocab.json. Real URL of this list should be stated in data/hint.js.

Running WebVectors

Once you have modified all the settings according to your workflow, made sure the templates are OK for you, and launched the models daemon, you are ready to actually start the service. If you use Apache integration, simply restart/reload Apache. If you prefer the standalone option, execute the following command in the root directory of the project:

gunicorn run_syn:app_syn -b address:port

where address is the address on which the service should be active (can be localhost), and port is, well, port to listen (for example, 9999).

Support for contextualized embeddings You can turn on support for contextualized embedding models (currently ELMo is supported). In order to do that:

  1. Install simple_elmo package

  2. Download an ELMo model of your choice (for example, here).

  3. Create a type-based projection in the word2vec format for a limited set of words (for example 10 000), given the ELMo model and a reference corpus. For this, use the extract_elmo.py script we provide:

python3 extract_elmo.py --input CORPUS --elmo PATH_TO_ELMO --outfile TYPE_EMBEDDING_FILE --vocab WORD_SET_FILE

It will run the ELMo model over the provided corpus and generate static averaged type embeddings for each word in the word set. They will be used as lexical substitutes.

  1. Prepare a frequency dictionary to use with the contextualized visualizations, as a plain-text tab-separated file, where the first column contains words and the second column contains their frequencies in the reference dictionary of your choice. The first line of this file should contain one integer matching the size of the corpus in word tokens.

  2. In the [Token] section of the webvectors.cfg configuration file, switch use_contextualized to True and state the paths to your token_model (pre-trained ELMo), type_model (the type-based projection you created with our script) and freq_file which is your frequency dictionary.

  3. In the ref_static_model field, specify any of your static word embedding models (just its name), which you want to use as the target of hyperlinks from words in the contextualized visualization pages.

  4. The page with ELMo lexical substitutes will be available at http://YOUR_ROOT_URL/contextual/

Contacts

In case of any problems, please feel free to contact us:

References

  1. http://www.aclweb.org/anthology/E17-3025

  2. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

  3. http://flask.pocoo.org/

  4. http://radimrehurek.com/gensim/

  5. http://gunicorn.org/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].