All Projects → sdimi → Average Word2vec

sdimi / Average Word2vec

🔤 Calculate average word embeddings (word2vec) from documents for transfer learning

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Average Word2vec

Modelsgenesis
Official Keras & PyTorch Implementation and Pre-trained Models for Models Genesis - MICCAI 2019
Stars: ✭ 416 (+700%)
Mutual labels:  jupyter-notebook, transfer-learning
Getting Things Done With Pytorch
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoencoders, Object Detection with YOLO v5, Build your first Neural Network, Time Series forecasting for Coronavirus daily cases, Sentiment Analysis with BERT.
Stars: ✭ 738 (+1319.23%)
Mutual labels:  jupyter-notebook, transfer-learning
Nlp Notebooks
A collection of notebooks for Natural Language Processing from NLP Town
Stars: ✭ 513 (+886.54%)
Mutual labels:  jupyter-notebook, word-embeddings
Trainyourownyolo
Train a state-of-the-art yolov3 object detector from scratch!
Stars: ✭ 399 (+667.31%)
Mutual labels:  jupyter-notebook, transfer-learning
Seismic Transfer Learning
Deep-learning seismic facies on state-of-the-art CNN architectures
Stars: ✭ 32 (-38.46%)
Mutual labels:  jupyter-notebook, transfer-learning
Xlearn
Transfer Learning Library
Stars: ✭ 406 (+680.77%)
Mutual labels:  jupyter-notebook, transfer-learning
Tensorflow 101
TensorFlow 101: Introduction to Deep Learning for Python Within TensorFlow
Stars: ✭ 642 (+1134.62%)
Mutual labels:  jupyter-notebook, transfer-learning
Biosentvec
BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences
Stars: ✭ 308 (+492.31%)
Mutual labels:  jupyter-notebook, word-embeddings
Syntree2vec
An algorithm to augment syntactic hierarchy into word embeddings
Stars: ✭ 9 (-82.69%)
Mutual labels:  jupyter-notebook, word-embeddings
Concise Ipython Notebooks For Deep Learning
Ipython Notebooks for solving problems like classification, segmentation, generation using latest Deep learning algorithms on different publicly available text and image data-sets.
Stars: ✭ 23 (-55.77%)
Mutual labels:  jupyter-notebook, word-embeddings
Amazon Forest Computer Vision
Amazon Forest Computer Vision: Satellite Image tagging code using PyTorch / Keras with lots of PyTorch tricks
Stars: ✭ 346 (+565.38%)
Mutual labels:  jupyter-notebook, transfer-learning
Teacher Student Training
This repository stores the files used for my summer internship's work on "teacher-student learning", an experimental method for training deep neural networks using a trained teacher model.
Stars: ✭ 34 (-34.62%)
Mutual labels:  jupyter-notebook, transfer-learning
Fast Pytorch
Pytorch Tutorial, Pytorch with Google Colab, Pytorch Implementations: CNN, RNN, DCGAN, Transfer Learning, Chatbot, Pytorch Sample Codes
Stars: ✭ 346 (+565.38%)
Mutual labels:  jupyter-notebook, transfer-learning
Deep learning nlp
Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP
Stars: ✭ 407 (+682.69%)
Mutual labels:  jupyter-notebook, word-embeddings
Ner Bert
BERT-NER (nert-bert) with google bert https://github.com/google-research.
Stars: ✭ 339 (+551.92%)
Mutual labels:  jupyter-notebook, transfer-learning
Video Classification
Tutorial for video classification/ action recognition using 3D CNN/ CNN+RNN on UCF101
Stars: ✭ 543 (+944.23%)
Mutual labels:  jupyter-notebook, transfer-learning
Hands On Deep Learning Algorithms With Python
Master Deep Learning Algorithms with Extensive Math by Implementing them using TensorFlow
Stars: ✭ 272 (+423.08%)
Mutual labels:  jupyter-notebook, word-embeddings
Pytorch Nlp Notebooks
Learn how to use PyTorch to solve some common NLP problems with deep learning.
Stars: ✭ 293 (+463.46%)
Mutual labels:  jupyter-notebook, transfer-learning
Skin Cancer Image Classification
Skin cancer classification using Inceptionv3
Stars: ✭ 16 (-69.23%)
Mutual labels:  jupyter-notebook, transfer-learning
Densedepth
High Quality Monocular Depth Estimation via Transfer Learning
Stars: ✭ 963 (+1751.92%)
Mutual labels:  jupyter-notebook, transfer-learning

Average words to represent documents with word2vec

Quick Python script I wrote in order to process the 20 Newsgroup dataset with word embeddings. Suggested to run on a Jupyter Notebook. Most word2vec word2vec pre-trained models allow to get numerical representations of individual words but not of entire documents. While most sophisticated methods like doc2vec exist, with this script we simply average each word of a document so that the generated document vector is actually a centroid of all words in feature space.

How can I use it?

title

Dependencies

gensim (for word2vec model load)

numpy (for averaging and array manipulation)

Optional

nltk (for text pre-processing)

sklearn (for dataset load)

matplotlib (for plotting)

Motivation and background

For the representation of text as numbers, there are many options out there. The simplest methodology when dealing with text is to create a word frequency matrix that simply counts the occurrence of each word. A variant of this method is to estimate the log scaled frequency of each word, but considering its occurrence in all documents (tf-idf). Also another popular option is to take into account the context around each word (n-grams), so that e.g. New York is evaluated as a bi-gram and not separately. However, these methods do not capture high level semantics of text, just frequencies. A recent advance on the field of Natural Language Processing proposed the use of word embeddings. Word embeddings are dense representations of text, coming through a feed-forward neural network. That way, each word is being represented by a point that is embedded in the high-dimensional space. With careful training, words that can be used interchangeably should have similar embeddings. A popular word embeddings network is word2vec. Word2vec is a simple, one-hidden-layer neural network that sums word embeddings and instead of minimizing a multi-class logistic loss (softmax), it minimizes a binary logistic loss on positive and negative samples, allowing to handle huge vocabularies efficiently.

In order to represent the 20Newsgroup documents, I use a pre-trained word2vec model provided by Google. This model was trained on 100 billion words of Google News and contains 300-dimensional vectors for 3 million words and phrases. As a pre-processing, the 20Newsgroups dataset was tokenized and the English stop-words were removed. Empty documents were removed (555 documents deleted). Documents with not at least 1 word in word2vec model were removed (9 documents deleted). The final resulting dataset consists of 18282 documents. For each document, the mean of the embeddings of each word was calculated, so that each document is represented by a 300-dimensional vector.

The newsgroup dataset was retrieved via its helper function from the Python library scikit-learn . The pre-trained word2vec model is available here. In order to process the model, the gensim library was used.

⚠️ Progress in NLP (2021 Update): Word2vec was a very popular method a couple of years ago but the area is moving very fast. You might be better off using more recent frameworks such as BERT, Transformers, and Spacy. Word2vec is still a good choice though for context-independent language modeling (see differences).

How to cite our papers

This code was developed as part of the data pre-processing section for our papers on interactive dimensionality reduction. Please consider citing our papers if you use code or ideas from this project:

[1] Spathis, Dimitris, Nikolaos Passalis, and Anastasios Tefas. "Interactive dimensionality reduction using similarity projections." Knowledge-Based Systems 165 (2019): 77-91.

[2] Spathis, Dimitris, Nikolaos Passalis, and Anastasios Tefas. "Fast, Visual and Interactive Semi-supervised Dimensionality Reduction." ECCV Efficient Feature Representation Learning workshop (2018), Munich, Germany.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].