All Projects → wikipedia2vec → Wikipedia2vec

wikipedia2vec / Wikipedia2vec

Licence: other
A tool for learning vector representations of words and entities from Wikipedia

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Wikipedia2vec

Textfooler
A Model for Natural Language Attack on Text Classification and Inference
Stars: ✭ 298 (-54.5%)
Mutual labels:  natural-language-processing, text-classification
Multi Class Text Classification Cnn
Classify Kaggle Consumer Finance Complaints into 11 classes. Build the model with CNN (Convolutional Neural Network) and Word Embeddings on Tensorflow.
Stars: ✭ 410 (-37.4%)
Mutual labels:  text-classification, embeddings
Adam qas
ADAM - A Question Answering System. Inspired from IBM Watson
Stars: ✭ 330 (-49.62%)
Mutual labels:  wikipedia, natural-language-processing
Catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
Stars: ✭ 224 (-65.8%)
Mutual labels:  natural-language-processing, embeddings
Hanlp
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Stars: ✭ 24,626 (+3659.69%)
Mutual labels:  natural-language-processing, text-classification
Pytorch Transformers Classification
Based on the Pytorch-Transformers library by HuggingFace. To be used as a starting point for employing Transformer models in text classification tasks. Contains code to easily train BERT, XLNet, RoBERTa, and XLM models for text classification.
Stars: ✭ 229 (-65.04%)
Mutual labels:  natural-language-processing, text-classification
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (-45.04%)
Mutual labels:  natural-language-processing, text-classification
Parallax
Tool for interactive embeddings visualization
Stars: ✭ 192 (-70.69%)
Mutual labels:  natural-language-processing, embeddings
Ner Lstm
Named Entity Recognition using multilayered bidirectional LSTM
Stars: ✭ 532 (-18.78%)
Mutual labels:  natural-language-processing, embeddings
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (-29.77%)
Mutual labels:  natural-language-processing, embeddings
Speedtorch
Library for faster pinned CPU <-> GPU transfer in Pytorch
Stars: ✭ 615 (-6.11%)
Mutual labels:  natural-language-processing, embeddings
Multi Class Text Classification Cnn Rnn
Classify Kaggle San Francisco Crime Description into 39 classes. Build the model with CNN, RNN (GRU and LSTM) and Word Embeddings on Tensorflow.
Stars: ✭ 570 (-12.98%)
Mutual labels:  text-classification, embeddings
Catalyst
Accelerated deep learning R&D
Stars: ✭ 2,804 (+328.09%)
Mutual labels:  natural-language-processing, text-classification
Text and Audio classification with Bert
Text Classification in Turkish Texts with Bert
Stars: ✭ 34 (-94.81%)
Mutual labels:  text-classification, embeddings
Bert4doc Classification
Code and source for paper ``How to Fine-Tune BERT for Text Classification?``
Stars: ✭ 220 (-66.41%)
Mutual labels:  natural-language-processing, text-classification
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (-45.34%)
Mutual labels:  natural-language-processing, text-classification
Vec4ir
Word Embeddings for Information Retrieval
Stars: ✭ 188 (-71.3%)
Mutual labels:  natural-language-processing, embeddings
Pyss3
A Python package implementing a new machine learning model for text classification with visualization tools for Explainable AI
Stars: ✭ 191 (-70.84%)
Mutual labels:  natural-language-processing, text-classification
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+3255.42%)
Mutual labels:  natural-language-processing, text-classification
Pythoncode Tutorials
The Python Code Tutorials
Stars: ✭ 544 (-16.95%)
Mutual labels:  natural-language-processing, text-classification

Wikipedia2Vec

Fury badge CircleCI

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the conventional skip-gram model to learn the embeddings of words, and its extension proposed in Yamada et al. (2016) to learn the embeddings of entities.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available here.

Documentation are available online at http://wikipedia2vec.github.io/.

Basic Usage

Wikipedia2Vec can be installed via PyPI:

% pip install wikipedia2vec

With this tool, embeddings can be learned by running a train command with a Wikipedia dump as input. For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE

Then, the learned embeddings are written to MODEL_FILE. Note that this command can take many optional parameters. Please refer to our documentation for further details.

Pretrained Embeddings

Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from this page.

Use Cases

Wikipedia2Vec has been applied to the following tasks:

References

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia.

@inproceedings{yamada2020wikipedia2vec,
  title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year = {2020},
  publisher = {Association for Computational Linguistics},
  pages = {23--30}
}

The embedding model was originally proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.

@inproceedings{yamada2016joint,
  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  year={2016},
  publisher={Association for Computational Linguistics},
  pages={250--259}
}

The text classification model implemented in this example was proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Neural Attentive Bag-of-Entities Model for Text Classification.

@article{yamada2019neural,
  title={Neural Attentive Bag-of-Entities Model for Text Classification},
  author={Yamada, Ikuya and Shindo, Hiroyuki},
  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
  year={2019},
  publisher={Association for Computational Linguistics},
  pages = {563--573}
}

License

Apache License 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].