Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → sdimi → Average Word2vec

sdimi / Average Word2vec

🔤 Calculate average word embeddings (word2vec) from documents for transfer learning

Programming Languages

python

139335 projects - #7 most used programming language

Labels

jupyter-notebook transfer-learning word-embeddings

Projects that are alternatives of or similar to Average Word2vec

Modelsgenesis

Official Keras & PyTorch Implementation and Pre-trained Models for Models Genesis - MICCAI 2019

Stars: ✭ 416 (+700%)

Mutual labels: jupyter-notebook, transfer-learning

Getting Things Done With Pytorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoencoders, Object Detection with YOLO v5, Build your first Neural Network, Time Series forecasting for Coronavirus daily cases, Sentiment Analysis with BERT.

Stars: ✭ 738 (+1319.23%)

Mutual labels: jupyter-notebook, transfer-learning

Nlp Notebooks

A collection of notebooks for Natural Language Processing from NLP Town

Stars: ✭ 513 (+886.54%)

Mutual labels: jupyter-notebook, word-embeddings

Trainyourownyolo

Train a state-of-the-art yolov3 object detector from scratch!

Stars: ✭ 399 (+667.31%)

Mutual labels: jupyter-notebook, transfer-learning

Seismic Transfer Learning

Deep-learning seismic facies on state-of-the-art CNN architectures

Stars: ✭ 32 (-38.46%)

Mutual labels: jupyter-notebook, transfer-learning

Xlearn

Transfer Learning Library

Stars: ✭ 406 (+680.77%)

Mutual labels: jupyter-notebook, transfer-learning

Tensorflow 101

TensorFlow 101: Introduction to Deep Learning for Python Within TensorFlow

Stars: ✭ 642 (+1134.62%)

Mutual labels: jupyter-notebook, transfer-learning

Biosentvec

BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences

Stars: ✭ 308 (+492.31%)

Mutual labels: jupyter-notebook, word-embeddings

Syntree2vec

An algorithm to augment syntactic hierarchy into word embeddings

Stars: ✭ 9 (-82.69%)

Mutual labels: jupyter-notebook, word-embeddings

Concise Ipython Notebooks For Deep Learning

Ipython Notebooks for solving problems like classification, segmentation, generation using latest Deep learning algorithms on different publicly available text and image data-sets.

Stars: ✭ 23 (-55.77%)

Mutual labels: jupyter-notebook, word-embeddings

Amazon Forest Computer Vision

Amazon Forest Computer Vision: Satellite Image tagging code using PyTorch / Keras with lots of PyTorch tricks

Stars: ✭ 346 (+565.38%)

Mutual labels: jupyter-notebook, transfer-learning

Teacher Student Training

This repository stores the files used for my summer internship's work on "teacher-student learning", an experimental method for training deep neural networks using a trained teacher model.

Stars: ✭ 34 (-34.62%)

Mutual labels: jupyter-notebook, transfer-learning

Fast Pytorch

Pytorch Tutorial, Pytorch with Google Colab, Pytorch Implementations: CNN, RNN, DCGAN, Transfer Learning, Chatbot, Pytorch Sample Codes

Stars: ✭ 346 (+565.38%)

Mutual labels: jupyter-notebook, transfer-learning

Deep learning nlp

Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP

Stars: ✭ 407 (+682.69%)

Mutual labels: jupyter-notebook, word-embeddings

Ner Bert

BERT-NER (nert-bert) with google bert https://github.com/google-research.

Stars: ✭ 339 (+551.92%)

Mutual labels: jupyter-notebook, transfer-learning

Video Classification

Tutorial for video classification/ action recognition using 3D CNN/ CNN+RNN on UCF101

Stars: ✭ 543 (+944.23%)

Mutual labels: jupyter-notebook, transfer-learning

Hands On Deep Learning Algorithms With Python

Master Deep Learning Algorithms with Extensive Math by Implementing them using TensorFlow

Stars: ✭ 272 (+423.08%)

Mutual labels: jupyter-notebook, word-embeddings

Pytorch Nlp Notebooks

Learn how to use PyTorch to solve some common NLP problems with deep learning.

Stars: ✭ 293 (+463.46%)

Mutual labels: jupyter-notebook, transfer-learning

Skin Cancer Image Classification

Skin cancer classification using Inceptionv3

Stars: ✭ 16 (-69.23%)

Mutual labels: jupyter-notebook, transfer-learning

Densedepth

High Quality Monocular Depth Estimation via Transfer Learning

Stars: ✭ 963 (+1751.92%)

Mutual labels: jupyter-notebook, transfer-learning

View All Similar Projects ➔

Average words to represent documents with word2vec

Quick Python script I wrote in order to process the 20 Newsgroup dataset with word embeddings. Suggested to run on a Jupyter Notebook. Most word2vec word2vec pre-trained models allow to get numerical representations of individual words but not of entire documents. While most sophisticated methods like doc2vec exist, with this script we simply average each word of a document so that the generated document vector is actually a centroid of all words in feature space.

How can I use it?

Dependencies

gensim (for word2vec model load)

numpy (for averaging and array manipulation)

Optional

nltk (for text pre-processing)

sklearn (for dataset load)

matplotlib (for plotting)

Motivation and background

For the representation of text as numbers, there are many options out there. The simplest methodology when dealing with text is to create a word frequency matrix that simply counts the occurrence of each word. A variant of this method is to estimate the log scaled frequency of each word, but considering its occurrence in all documents (tf-idf). Also another popular option is to take into account the context around each word (n-grams), so that e.g. New York is evaluated as a bi-gram and not separately. However, these methods do not capture high level semantics of text, just frequencies. A recent advance on the field of Natural Language Processing proposed the use of word embeddings. Word embeddings are dense representations of text, coming through a feed-forward neural network. That way, each word is being represented by a point that is embedded in the high-dimensional space. With careful training, words that can be used interchangeably should have similar embeddings. A popular word embeddings network is word2vec. Word2vec is a simple, one-hidden-layer neural network that sums word embeddings and instead of minimizing a multi-class logistic loss (softmax), it minimizes a binary logistic loss on positive and negative samples, allowing to handle huge vocabularies efficiently.

In order to represent the 20Newsgroup documents, I use a pre-trained word2vec model provided by Google. This model was trained on 100 billion words of Google News and contains 300-dimensional vectors for 3 million words and phrases. As a pre-processing, the 20Newsgroups dataset was tokenized and the English stop-words were removed. Empty documents were removed (555 documents deleted). Documents with not at least 1 word in word2vec model were removed (9 documents deleted). The final resulting dataset consists of 18282 documents. For each document, the mean of the embeddings of each word was calculated, so that each document is represented by a 300-dimensional vector.

The newsgroup dataset was retrieved via its helper function from the Python library scikit-learn . The pre-trained word2vec model is available here. In order to process the model, the gensim library was used.

⚠️ Progress in NLP (2021 Update): Word2vec was a very popular method a couple of years ago but the area is moving very fast. You might be better off using more recent frameworks such as BERT, Transformers, and Spacy. Word2vec is still a good choice though for context-independent language modeling (see differences).

How to cite our papers

This code was developed as part of the data pre-processing section for our papers on interactive dimensionality reduction. Please consider citing our papers if you use code or ideas from this project:

[1] Spathis, Dimitris, Nikolaos Passalis, and Anastasios Tefas. "Interactive dimensionality reduction using similarity projections." Knowledge-Based Systems 165 (2019): 77-91.

[2] Spathis, Dimitris, Nikolaos Passalis, and Anastasios Tefas. "Fast, Visual and Interactive Semi-supervised Dimensionality Reduction." ECCV Efficient Feature Representation Learning workshop (2018), Munich, Germany.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 52

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗