All Projects → zhezhaoa → Ngram2vec

zhezhaoa / Ngram2vec

Four word embedding models implemented in Python. Supporting arbitrary context features

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Ngram2vec

Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (-42.53%)
Mutual labels:  word, word2vec, glove
Lightnlp
基于Pytorch和torchtext的自然语言处理深度学习框架。
Stars: ✭ 739 (+5.12%)
Mutual labels:  chinese, word2vec
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+846.8%)
Mutual labels:  chinese, word2vec
word2vec-movies
Bag of words meets bags of popcorn in Python 3 中文教程
Stars: ✭ 54 (-92.32%)
Mutual labels:  word2vec, chinese
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (-78.52%)
Mutual labels:  word2vec, glove
Word2vec
Go library for performing computations in word2vec binary models
Stars: ✭ 143 (-79.66%)
Mutual labels:  word, word2vec
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (-85.92%)
Mutual labels:  word2vec, glove
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+98.29%)
Mutual labels:  word2vec, glove
sarcasm-detection-for-sentiment-analysis
Sarcasm Detection for Sentiment Analysis
Stars: ✭ 21 (-97.01%)
Mutual labels:  word2vec, glove
navec
Compact high quality word embeddings for Russian language
Stars: ✭ 118 (-83.21%)
Mutual labels:  word2vec, glove
Wego
Word Embeddings (e.g. Word2Vec) in Go!
Stars: ✭ 336 (-52.2%)
Mutual labels:  word2vec, glove
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-79.23%)
Mutual labels:  word2vec, glove
Hierarchical Attention Network
Implementation of Hierarchical Attention Networks in PyTorch
Stars: ✭ 120 (-82.93%)
Mutual labels:  word2vec, glove
Sensegram
Making sense embedding out of word embeddings using graph-based word sense induction
Stars: ✭ 209 (-70.27%)
Mutual labels:  word, word2vec
Textclf
TextClf :基于Pytorch/Sklearn的文本分类框架,包括逻辑回归、SVM、TextCNN、TextRNN、TextRCNN、DRNN、DPCNN、Bert等多种模型,通过简单配置即可完成数据处理、模型训练、测试等过程。
Stars: ✭ 105 (-85.06%)
Mutual labels:  word2vec, glove
Cn sort
中文排序:按拼音/笔顺快速排序简体中文词组(百万数量级,可含中英/多音字)。如果对您有所帮助,欢迎点个star鼓励一下。
Stars: ✭ 102 (-85.49%)
Mutual labels:  chinese, word
Vectorsinsearch
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015
Stars: ✭ 71 (-89.9%)
Mutual labels:  word2vec, glove
Glove As A Tensorflow Embedding Layer
Taking a pretrained GloVe model, and using it as a TensorFlow embedding weight layer **inside the GPU**. Therefore, you only need to send the index of the words through the GPU data transfer bus, reducing data transfer overhead.
Stars: ✭ 85 (-87.91%)
Mutual labels:  word2vec, glove
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-96.73%)
Mutual labels:  word2vec, glove
Pycadl
Python package with source code from the course "Creative Applications of Deep Learning w/ TensorFlow"
Stars: ✭ 356 (-49.36%)
Mutual labels:  word2vec, glove

Ngram2vec

Ngram2vec toolkit is originally used for reproducing results of the paper Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics , aiming at learning high quality word embedding and ngram embedding.

Thansks to its well-designed architecture (we will talk about it later), ngram2vec toolkit provides a general and powerful framework, which is able to include researches of a large amount of papers and many popular toolkits such as word2vec. Ngram2vec toolkit allows researchers to learn representations upon co-occurrence statistics easily. Ngram2vec can generate embeddings of different granularities (beyond word embedding). For example, ngram2vec toolkit could be used for learning text embedding. Text embeddings trained by ngram2vec are very competitive. They outperform many deep and complex neural networks and achieve state-of-the-art results on a range of datasets. More details will be released later.

Ngram2vec has been successfully applied on many projects. For example, Chinese-Word-Vectors provides over 100 Chinese word embeddings with different properties. All embeddings are trained by ngram2vec toolkit.

The original version (v0.0.0) of ngram2vec can be downloaded on github release. Python2 is recommended. One can download ngram2vec v0.0.0 for reproducing results.

Features

Ngram2vec is featured by decoupled architecture. The process from raw corpus to final embeddings is decoupled into multiple modules. This brings many advantages compared with other toolkits.

  • Well-organized: The ngram2vec toolkit is easy to read and understand.
  • Extensibility: One can add co-occurrence statistics and embedding models with little effort.
  • Intermediate results reuse: Intermediate results are written to disk and reused later, which largely boosts the efficiency in both speed and space.
  • Comprehensive: Ngram2vec includes a large amount of works related with word embedding
  • Embeddings of different linguistic units: Ngram2vec can learn embeddings of different linguistic units. For example, ngram2vec is able to produce high-quality text embeddings which achieve SOTA reults on a range of datasets.

Requirements

  • Python (both Python2 and 3 are supported)
  • numpy
  • scipy
  • sparsesvd

Example use cases

Firstly, run the following codes to make some files executable.
chmod +x *.sh
chmod +x scripts/clean_corpus.sh
python scripts/compile_c.py

Also, a corpus should be prepared. We recommend to fetch it at
http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 , a wiki corpus without XML tags. scripts/clean_corpus.sh is used for cleaning English corpus.
For example scripts/clean_corpus.sh WestburyLab.wikicorp.201004.txt > wiki2010.clean
A pre-processed (including segmentation) chinese wiki corpus is available at https://pan.baidu.com/s/1kURV0rl , which can be directly used as input of this toolkit.

run ./word_example.sh to see baselines
run ./ngram_example.sh to introduce ngram into recent word representation methods inspired by traditional language modeling problem.br>

Workflow

Testsets

Besides English word analogy and similarity datasets, we provide several Chinese analogy datasets, which contain comprehensive analogy questions. Some of them are constructed by directly translating English analogy datasets. Some are unique to Chinese. I hope they could become useful resources for evaluating Chinese word embedding.

References

@inproceedings{DBLP:conf/emnlp/ZhaoLLLD17,
     author = {Zhe Zhao and Tao Liu and Shen Li and Bofang Li and Xiaoyong Du},
     title = {Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics},   
     booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2017, Copenhagen, Denmark, September 9-11, 2017},      
     year = {2017}
 }

Acknowledgments

This toolkit is inspired by Omer Levy's work http://bitbucket.org/omerlevy/hyperwords
We reuse part of his code in this toolkit. We also thank him for his kind suggestions.
I also got the help from Bofang Li, Prof. Ju Fan, and Jianwei Cui in Xiaomi.
My tutors are Tao Liu and Xiaoyong Du

Contact us

We are looking forward to receiving your questions and advice to this toolkit. We will reply you as soon as possible. We will further perfect this toolkit.

Zhe Zhao, [email protected], from DBIIR lab
Shen Li, [email protected]
Renfen Hu, [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].