Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → wikipedia2vec → Wikipedia2vec

wikipedia2vec / Wikipedia2vec

Licence: other

A tool for learning vector representations of words and entities from Wikipedia

Programming Languages

python

139335 projects - #7 most used programming language

Labels

nlp natural-language-processing text-classification embeddings wikipedia

Projects that are alternatives of or similar to Wikipedia2vec

Textfooler

A Model for Natural Language Attack on Text Classification and Inference

Stars: ✭ 298 (-54.5%)

Mutual labels: natural-language-processing, text-classification

Multi Class Text Classification Cnn

Classify Kaggle Consumer Finance Complaints into 11 classes. Build the model with CNN (Convolutional Neural Network) and Word Embeddings on Tensorflow.

Stars: ✭ 410 (-37.4%)

Mutual labels: text-classification, embeddings

Adam qas

ADAM - A Question Answering System. Inspired from IBM Watson

Stars: ✭ 330 (-49.62%)

Mutual labels: wikipedia, natural-language-processing

Catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.

Stars: ✭ 224 (-65.8%)

Mutual labels: natural-language-processing, embeddings

Hanlp

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

Stars: ✭ 24,626 (+3659.69%)

Mutual labels: natural-language-processing, text-classification

Pytorch Transformers Classification

Based on the Pytorch-Transformers library by HuggingFace. To be used as a starting point for employing Transformer models in text classification tasks. Contains code to easily train BERT, XLNet, RoBERTa, and XLM models for text classification.

Stars: ✭ 229 (-65.04%)

Mutual labels: natural-language-processing, text-classification

Spacy Streamlit

👑 spaCy building blocks and visualizers for Streamlit apps

Stars: ✭ 360 (-45.04%)

Mutual labels: natural-language-processing, text-classification

Parallax

Tool for interactive embeddings visualization

Stars: ✭ 192 (-70.69%)

Mutual labels: natural-language-processing, embeddings

Ner Lstm

Named Entity Recognition using multilayered bidirectional LSTM

Stars: ✭ 532 (-18.78%)

Mutual labels: natural-language-processing, embeddings

Awesome Persian Nlp Ir

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Stars: ✭ 460 (-29.77%)

Mutual labels: natural-language-processing, embeddings

Speedtorch

Library for faster pinned CPU <-> GPU transfer in Pytorch

Stars: ✭ 615 (-6.11%)

Mutual labels: natural-language-processing, embeddings

Multi Class Text Classification Cnn Rnn

Classify Kaggle San Francisco Crime Description into 39 classes. Build the model with CNN, RNN (GRU and LSTM) and Word Embeddings on Tensorflow.

Stars: ✭ 570 (-12.98%)

Mutual labels: text-classification, embeddings

Catalyst

Accelerated deep learning R&D

Stars: ✭ 2,804 (+328.09%)

Mutual labels: natural-language-processing, text-classification

Text and Audio classification with Bert

Text Classification in Turkish Texts with Bert

Stars: ✭ 34 (-94.81%)

Mutual labels: text-classification, embeddings

Bert4doc Classification

Code and source for paper ``How to Fine-Tune BERT for Text Classification?``

Stars: ✭ 220 (-66.41%)

Mutual labels: natural-language-processing, text-classification

Text mining resources

Resources for learning about Text Mining and Natural Language Processing

Stars: ✭ 358 (-45.34%)

Mutual labels: natural-language-processing, text-classification

Vec4ir

Word Embeddings for Information Retrieval

Stars: ✭ 188 (-71.3%)

Mutual labels: natural-language-processing, embeddings

Pyss3

A Python package implementing a new machine learning model for text classification with visualization tools for Explainable AI

Stars: ✭ 191 (-70.84%)

Mutual labels: natural-language-processing, text-classification

Spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars: ✭ 21,978 (+3255.42%)

Mutual labels: natural-language-processing, text-classification

Pythoncode Tutorials

The Python Code Tutorials

Stars: ✭ 544 (-16.95%)

Mutual labels: natural-language-processing, text-classification

View All Similar Projects ➔

Wikipedia2Vec

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the conventional skip-gram model to learn the embeddings of words, and its extension proposed in Yamada et al. (2016) to learn the embeddings of entities.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available here.

Documentation are available online at http://wikipedia2vec.github.io/.

Basic Usage

Wikipedia2Vec can be installed via PyPI:

% pip install wikipedia2vec

With this tool, embeddings can be learned by running a train command with a Wikipedia dump as input. For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE

Then, the learned embeddings are written to MODEL_FILE. Note that this command can take many optional parameters. Please refer to our documentation for further details.

Pretrained Embeddings

Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from this page.

Use Cases

Wikipedia2Vec has been applied to the following tasks:

Entity linking: Yamada et al., 2016, Eshel et al., 2017, Chen et al., 2019, Poerner et al., 2020, van Hulst et al., 2020.
Named entity recognition: Sato et al., 2017, Lara-Clares and Garcia-Serrano, 2019.
Question answering: Yamada et al., 2017, Poerner et al., 2020.
Entity typing: Yamada et al., 2018.
Text classification: Yamada et al., 2018, Yamada and Shindo, 2019, Alam et al., 2020.
Relation classification: Poerner et al., 2020.
Paraphrase detection: Duong et al., 2018.
Knowledge graph completion: Shah et al., 2019, Shah et al., 2020.
Fake news detection: Singh et al., 2019, Ghosal et al., 2020.
Plot analysis of movies: Papalampidi et al., 2019.
Novel entity discovery: Zhang et al., 2020.
Entity retrieval: Gerritse et al., 2020.
Deepfake detection: Zhong et al., 2020.
Conversational information seeking: Rodriguez et al., 2020.
Query expansion: Rosin et al., 2020.

References

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia.

@inproceedings{yamada2020wikipedia2vec,
  title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year = {2020},
  publisher = {Association for Computational Linguistics},
  pages = {23--30}
}

The embedding model was originally proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.

@inproceedings{yamada2016joint,
  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  year={2016},
  publisher={Association for Computational Linguistics},
  pages={250--259}
}

The text classification model implemented in this example was proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Neural Attentive Bag-of-Entities Model for Text Classification.

@article{yamada2019neural,
  title={Neural Attentive Bag-of-Entities Model for Text Classification},
  author={Yamada, Ikuya and Shindo, Hiroyuki},
  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
  year={2019},
  publisher={Association for Computational Linguistics},
  pages = {563--573}
}

License

Apache License 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 655

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗