All Projects → yumeng5 → Spherical Text Embedding

yumeng5 / Spherical Text Embedding

Licence: apache-2.0
[NeurIPS 2019] Spherical Text Embedding

Programming Languages

c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to Spherical Text Embedding

Text Summarizer
Python Framework for Extractive Text Summarization
Stars: ✭ 96 (-32.87%)
Mutual labels:  unsupervised-learning, word-embeddings
Awesome Sentence Embedding
A curated list of pretrained sentence and word embedding models
Stars: ✭ 1,973 (+1279.72%)
Mutual labels:  unsupervised-learning, word-embeddings
3dpose gan
The authors' implementation of Unsupervised Adversarial Learning of 3D Human Pose from 2D Joint Locations
Stars: ✭ 124 (-13.29%)
Mutual labels:  unsupervised-learning
Isolation Forest
A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm.
Stars: ✭ 139 (-2.8%)
Mutual labels:  unsupervised-learning
Awesome Community Detection
A curated list of community detection research papers with implementations.
Stars: ✭ 1,874 (+1210.49%)
Mutual labels:  unsupervised-learning
Hash Embeddings
PyTorch implementation of Hash Embeddings (NIPS 2017). Submission to the NIPS Implementation Challenge.
Stars: ✭ 126 (-11.89%)
Mutual labels:  word-embeddings
Arflow
The official PyTorch implementation of the paper "Learning by Analogy: Reliable Supervision from Transformations for Unsupervised Optical Flow Estimation".
Stars: ✭ 134 (-6.29%)
Mutual labels:  unsupervised-learning
Sfmlearner
An unsupervised learning framework for depth and ego-motion estimation from monocular videos
Stars: ✭ 1,661 (+1061.54%)
Mutual labels:  unsupervised-learning
Deepmapping
code/webpage for the DeepMapping project
Stars: ✭ 140 (-2.1%)
Mutual labels:  unsupervised-learning
E3d lstm
e3d-lstm; Eidetic 3D LSTM A Model for Video Prediction and Beyond
Stars: ✭ 129 (-9.79%)
Mutual labels:  unsupervised-learning
Splitbrainauto
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. In CVPR, 2017.
Stars: ✭ 137 (-4.2%)
Mutual labels:  unsupervised-learning
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (-11.19%)
Mutual labels:  word-embeddings
Tybalt
Training and evaluating a variational autoencoder for pan-cancer gene expression data
Stars: ✭ 126 (-11.89%)
Mutual labels:  unsupervised-learning
Oneshottranslation
Pytorch implementation of "One-Shot Unsupervised Cross Domain Translation" NIPS 2018
Stars: ✭ 135 (-5.59%)
Mutual labels:  unsupervised-learning
Gon
Gradient Origin Networks - a new type of generative model that is able to quickly learn a latent representation without an encoder
Stars: ✭ 126 (-11.89%)
Mutual labels:  unsupervised-learning
Complete Life Cycle Of A Data Science Project
Complete-Life-Cycle-of-a-Data-Science-Project
Stars: ✭ 140 (-2.1%)
Mutual labels:  unsupervised-learning
Cleanlab
The standard package for machine learning with noisy labels, finding mislabeled data, and uncertainty quantification. Works with most datasets and models.
Stars: ✭ 2,526 (+1666.43%)
Mutual labels:  unsupervised-learning
Deepco3
[CVPR19] DeepCO3: Deep Instance Co-segmentation by Co-peak Search and Co-saliency (Oral paper)
Stars: ✭ 127 (-11.19%)
Mutual labels:  unsupervised-learning
And
Official Pytorch Implementation for ICML'19 paper: Unsupervised Deep Learning by Neighbourhood Discovery
Stars: ✭ 133 (-6.99%)
Mutual labels:  unsupervised-learning
Flappy Es
Flappy Bird AI using Evolution Strategies
Stars: ✭ 140 (-2.1%)
Mutual labels:  unsupervised-learning

Spherical Text Embedding

The source code used for Spherical Text Embedding, published in NeurIPS 2019. The code structure (especially file reading and saving functions) is adapted from the Word2Vec implementation.

Requirements

Pre-trained Embeddings

We provide pre-trained JoSE embeddings on the wikipedia dump.

Unlike Euclidean embeddings such as Word2Vec and GloVe, spherical embeddings do not necessarily benefit from higher-dimensional space, so it might be a good idea to start with lower-dimensional ones first.

Run the Code

We provide a shell script run.sh for compiling the source file and training embedding.

Note: When preparing the training text corpus, make sure each line in the file is one document/paragraph.

Hyperparameters

Note: It is recommended to use the default hyperparameters, especially the number of negative samples (-negative) and loss function margin (-margin).

Invoke the command without arguments for a list of hyperparameters and their meanings:

$ ./src/jose
Parameters:
        -train <file> (mandatory argument)
                Use text data from <file> to train the model
        -word-output <file>
                Use <file> to save the resulting word vectors
        -context-output <file>
                Use <file> to save the resulting word context vectors
        -doc-output <file>
                Use <file> to save the resulting document vectors
        -size <int>
                Set size of word vectors; default is 100
        -window <int>
                Set max skip length between words; default is 5
        -sample <float>
                Set threshold for occurrence of words. Those that appear with higher frequency in the
                training data will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-3)
        -negative <int>
                Number of negative examples; default is 2
        -threads <int>
                Use <int> threads; default is 20
        -margin <float>
                Margin used in loss function to separate positive samples from negative samples; default is 0.15
        -iter <int>
                Run more training iterations; default is 10
        -min-count <int>
                This will discard words that appear less than <int> times; default is 5
        -alpha <float>
                Set the starting learning rate; default is 0.04
        -debug <int>
                Set the debug mode (default = 2 = more info during training)
        -save-vocab <file>
                The vocabulary will be saved to <file>
        -read-vocab <file>
                The vocabulary will be read from <file>, not constructed from the training data
        -load-emb <file>
                The pretrained embeddings will be read from <file>

Examples:
./jose -train text.txt -word-output jose.txt -size 100 -margin 0.15 -window 5 -sample 1e-3 -negative 2 -iter 10

Word Similarity Evaluation

We provide a shell script eval_sim.sh for word similarity evaluation of trained spherical word embeddings on the wikipedia dump. The script will first download a zipped file of the pre-processed wikipedia dump (retrieved 2019.05; the zipped version is of ~4GB; the unzipped one is of ~13GB; for a detailed description of the dataset, see its README file), and then run JoSE on it. Finally, the trained embeddings are evaluated on three benchmark word similarity datasets: WordSim-353, MEN and SimLex-999.

Document Clustering Evaluation

We provide a shell script eval_cluster.sh for document clustering evaluation of trained spherical document embeddings on the 20 Newsgroup dataset. The script will perform K-Means and Spherical K-Means clustering on the trained document embeddings.

Document Classification Evaluation

We provide a shell script eval_classify.sh for document classification evaluation of trained spherical document embeddings on the 20 Newsgroup dataset. The script will perform KNN classification following the original 20 Newsgroup train/test split with the trained document embeddings as features.

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2019spherical,
  title={Spherical Text Embedding},
  author={Meng, Yu and Huang, Jiaxin and Wang, Guangyuan and Zhang, Chao and Zhuang, Honglei and Kaplan, Lance and Han, Jiawei},
  booktitle={Advances in neural information processing systems},
  year={2019}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].