All Projects → inejc → Paragraph Vectors

inejc / Paragraph Vectors

Licence: mit
📄 A PyTorch implementation of Paragraph Vectors (doc2vec).

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Paragraph Vectors

Minisom
🔴 MiniSom is a minimalistic implementation of the Self Organizing Maps
Stars: ✭ 801 (+137.69%)
Mutual labels:  neural-networks, unsupervised-learning
Dgi
Deep Graph Infomax (https://arxiv.org/abs/1809.10341)
Stars: ✭ 326 (-3.26%)
Mutual labels:  neural-networks, unsupervised-learning
Gon
Gradient Origin Networks - a new type of generative model that is able to quickly learn a latent representation without an encoder
Stars: ✭ 126 (-62.61%)
Mutual labels:  neural-networks, unsupervised-learning
Pyod
A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)
Stars: ✭ 5,083 (+1408.31%)
Mutual labels:  neural-networks, unsupervised-learning
Sealion
The first machine learning framework that encourages learning ML concepts instead of memorizing class functions.
Stars: ✭ 278 (-17.51%)
Mutual labels:  neural-networks, unsupervised-learning
Gdrl
Grokking Deep Reinforcement Learning
Stars: ✭ 304 (-9.79%)
Mutual labels:  neural-networks
Probability
Probabilistic reasoning and statistical analysis in TensorFlow
Stars: ✭ 3,550 (+953.41%)
Mutual labels:  neural-networks
Pytorch exercises
Stars: ✭ 304 (-9.79%)
Mutual labels:  neural-networks
Segmentation models.pytorch
Segmentation models with pretrained backbones. PyTorch.
Stars: ✭ 4,584 (+1260.24%)
Mutual labels:  neural-networks
Machine learning basics
Plain python implementations of basic machine learning algorithms
Stars: ✭ 3,557 (+955.49%)
Mutual labels:  neural-networks
Artificio
Deep Learning Computer Vision Algorithms for Real-World Use
Stars: ✭ 326 (-3.26%)
Mutual labels:  neural-networks
Pywick
High-level batteries-included neural network training library for Pytorch
Stars: ✭ 320 (-5.04%)
Mutual labels:  neural-networks
Kraken
OCR engine for all the languages
Stars: ✭ 304 (-9.79%)
Mutual labels:  neural-networks
Chinese Ufldl Tutorial
[UNMAINTAINED] 非监督特征学习与深度学习中文教程,该版本翻译自新版 UFLDL Tutorial 。建议新人们去学习斯坦福的CS231n课程,该门课程在网易云课堂上也有一个配有中文字幕的版本。
Stars: ✭ 303 (-10.09%)
Mutual labels:  unsupervised-learning
Mace Models
Mobile AI Compute Engine Model Zoo
Stars: ✭ 329 (-2.37%)
Mutual labels:  neural-networks
Soft Dtw
Python implementation of soft-DTW.
Stars: ✭ 300 (-10.98%)
Mutual labels:  neural-networks
Selflow
SelFlow: Self-Supervised Learning of Optical Flow
Stars: ✭ 319 (-5.34%)
Mutual labels:  unsupervised-learning
Deepspeech
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Stars: ✭ 18,680 (+5443.03%)
Mutual labels:  neural-networks
Cs231
Complete Assignments for CS231n: Convolutional Neural Networks for Visual Recognition
Stars: ✭ 317 (-5.93%)
Mutual labels:  neural-networks
Neural Pipeline
Neural networks training pipeline based on PyTorch
Stars: ✭ 315 (-6.53%)
Mutual labels:  neural-networks

Paragraph Vectors

Build Status codecov codebeat badge Codacy Badge

A PyTorch implementation of Paragraph Vectors (doc2vec).

All models minimize the Negative Sampling objective as proposed by T. Mikolov et al. [1]. This provides scope for sparse updates (i.e. only vectors of sampled noise words are used in forward and backward passes). In addition to that, batches of training data (with noise sampling) are generated in parallel on a CPU while the model is trained on a GPU.

Caveat emptor! Be warned that paragraph-vectors is in an early-stage development phase. Feedback, comments, suggestions, contributions, etc. are more than welcome.

Installation

  1. Install PyTorch (follow the link for instructions).
  2. Install the paragraph-vectors library.
git clone https://github.com/inejc/paragraph-vectors.git
cd paragraph-vectors
pip install -e .

Note that installation in a virtual environment is the recommended way.

Usage

  1. Put a csv file in the data directory. Each row represents a single document and the first column should always contain the text. Note that a header line is mandatory.
data/example.csv
----------------
text,...
"In the week before their departure to Arrakis, when all the final scurrying about had reached a nearly unbearable frenzy, an old crone came to visit the mother of the boy, Paul.",...
"It was a warm night at Castle Caladan, and the ancient pile of stone that had served the Atreides family as home for twenty-six generations bore that cooled-sweat feeling it acquired before a change in the weather.",...
...
  1. Run train.py with selected parameters (models are saved in the models directory).
python train.py start --data_file_name 'example.csv' --num_epochs 100 --batch_size 32 --num_noise_words 2 --vec_dim 100 --lr 1e-3

Parameters

  • data_file_name: str
    Name of a file in the data directory.
  • model_ver: str, one of ('dm', 'dbow'), default='dbow'
    Version of the model as proposed by Q. V. Le et al. [5], Distributed Representations of Sentences and Documents. 'dbow' stands for Distributed Bag Of Words, 'dm' stands for Distributed Memory.
  • vec_combine_method: str, one of ('sum', 'concat'), default='sum'
    Method for combining paragraph and word vectors when model_ver='dm'. Currently only the 'sum' operation is implemented.
  • context_size: int, default=0
    Half the size of a neighbourhood of target words when model_ver='dm' (i.e. how many words left and right are regarded as context). When model_ver='dm' context_size has to greater than 0, when model_ver='dbow' context_size has to be 0.
  • num_noise_words: int
    Number of noise words to sample from the noise distribution.
  • vec_dim: int
    Dimensionality of vectors to be learned (for paragraphs and words).
  • num_epochs: int
    Number of iterations to train the model (i.e. number of times every example is seen during training).
  • batch_size: int
    Number of examples per single gradient update.
  • lr: float
    Learning rate of the Adam optimizer.
  • save_all: bool, default=False
    Indicates whether a checkpoint is saved after each epoch. If false, only the best performing model is saved.
  • generate_plot: bool, default=True
    Indicates whether a diagnostic plot displaying loss value over epochs is generated after each epoch.
  • max_generated_batches: int, default=5
    Maximum number of pre-generated batches.
  • num_workers: int, default=1
    Number of batch generator jobs to run in parallel. If value is set to -1, total number of machine CPUs is used. Note that order of batches is not guaranteed when num_workers > 1.
  1. Export trained paragraph vectors to a csv file (vectors are saved in the data directory).
python export_vectors.py start --data_file_name 'example.csv' --model_file_name 'example_model.dbow_numnoisewords.2_vecdim.100_batchsize.32_lr.0.001000_epoch.25_loss.0.981524.pth.tar'

Parameters

  • data_file_name: str
    Name of a file in the data directory that was used during training.
  • model_file_name: str
    Name of a file in the models directory (a model trained on the data_file_name dataset).

Example of trained vectors

First two principal components (1% cumulative variance explained) of 300-dimensional document vectors trained on arXiv abstracts. Shown are two subcategories from Computer Science. Dataset was comprised of 74219 documents and 91417 unique words.

Resources

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].