All Projects → oxford-cs-deepnlp-2017 → Practical 1

oxford-cs-deepnlp-2017 / Practical 1

Oxford Deep NLP 2017 course - Practical 1: word2vec

Projects that are alternatives of or similar to Practical 1

Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+259.09%)
Mutual labels:  jupyter-notebook, natural-language-processing, word2vec
Deep Math Machine Learning.ai
A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.
Stars: ✭ 173 (-21.36%)
Mutual labels:  jupyter-notebook, natural-language-processing, word2vec
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (-14.09%)
Mutual labels:  jupyter-notebook, natural-language-processing, word2vec
Awesome Embedding Models
A curated list of awesome embedding models tutorials, projects and communities.
Stars: ✭ 1,486 (+575.45%)
Mutual labels:  jupyter-notebook, natural-language-processing, word2vec
Log Anomaly Detector
Log Anomaly Detection - Machine learning to detect abnormal events logs
Stars: ✭ 169 (-23.18%)
Mutual labels:  jupyter-notebook, word2vec
Fixy
Amacımız Türkçe NLP literatüründeki birçok farklı sorunu bir arada çözebilen, eşsiz yaklaşımlar öne süren ve literatürdeki çalışmaların eksiklerini gideren open source bir yazım destekleyicisi/denetleyicisi oluşturmak. Kullanıcıların yazdıkları metinlerdeki yazım yanlışlarını derin öğrenme yaklaşımıyla çözüp aynı zamanda metinlerde anlamsal analizi de gerçekleştirerek bu bağlamda ortaya çıkan yanlışları da fark edip düzeltebilmek.
Stars: ✭ 165 (-25%)
Mutual labels:  jupyter-notebook, natural-language-processing
Dive Into Dl Pytorch
本项目将《动手学深度学习》(Dive into Deep Learning)原书中的MXNet实现改为PyTorch实现。
Stars: ✭ 14,234 (+6370%)
Mutual labels:  jupyter-notebook, natural-language-processing
Nel
Entity linking framework
Stars: ✭ 176 (-20%)
Mutual labels:  jupyter-notebook, natural-language-processing
Rnn lstm from scratch
How to build RNNs and LSTMs from scratch with NumPy.
Stars: ✭ 156 (-29.09%)
Mutual labels:  jupyter-notebook, natural-language-processing
Web Database Analytics
Web scrapping and related analytics using Python tools
Stars: ✭ 175 (-20.45%)
Mutual labels:  jupyter-notebook, natural-language-processing
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (-17.73%)
Mutual labels:  jupyter-notebook, natural-language-processing
Newsrecommender
A news recommendation system tailored for user communities
Stars: ✭ 164 (-25.45%)
Mutual labels:  jupyter-notebook, natural-language-processing
Mixtext
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification
Stars: ✭ 159 (-27.73%)
Mutual labels:  jupyter-notebook, natural-language-processing
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+5701.36%)
Mutual labels:  natural-language-processing, word2vec
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (-17.27%)
Mutual labels:  jupyter-notebook, natural-language-processing
Notebooks
Jupyter Notebooks with Deep Learning Tutorials
Stars: ✭ 188 (-14.55%)
Mutual labels:  jupyter-notebook, natural-language-processing
Debiaswe
Remove problematic gender bias from word embeddings.
Stars: ✭ 175 (-20.45%)
Mutual labels:  jupyter-notebook, word2vec
Aind Nlp
Coding exercises for the Natural Language Processing concentration, part of Udacity's AIND program.
Stars: ✭ 202 (-8.18%)
Mutual labels:  jupyter-notebook, natural-language-processing
Natural Language Processing Specialization
This repo contains my coursework, assignments, and Slides for Natural Language Processing Specialization by deeplearning.ai on Coursera
Stars: ✭ 151 (-31.36%)
Mutual labels:  jupyter-notebook, natural-language-processing
Pytorch Question Answering
Important paper implementations for Question Answering using PyTorch
Stars: ✭ 154 (-30%)
Mutual labels:  jupyter-notebook, natural-language-processing

Practical 1: word2vec

[Brendan Shillingford, Yannis Assael, Chris Dyer]

For this practical, you'll be provided with a partially-complete IPython notebook, an interactive web-based Python computing environment that allows us to mix text, code, and interactive plots.

We will be training word2vec models on TED Talk and Wikipedia data, using the word2vec implementation included in the Python package gensim. After training the models, we will analyze and visualize the learned embeddings.

Setup and installation

On a lab workstation, clone the practical repository and run the . install-python.sh shell script in a terminal to install Anaconda with Python 3, and the packages required for this practical.

Run ipython notebook in the repository directory and open the practical.ipynb notebook in your browser.

Preliminaries

Preprocessing

The code for downloading the dataset and preprocessing it is prewritten to save time. However, it is expected that you'll need to perform such a task in future practicals, given raw data. Read it and make sure you understand it. Often, one uses a library like nltk to simplify this task, but we have not done so here and instead opted to use regular expressions via Python's re module.

Word frequencies

Make a list of the most common words and their occurence counts. Take a look at the top 40 words. You may want to use the sklearn.feature_extraction.text module's CountVectorizer class or the collections module's Counter class.

Take the top 1000 words, and plot a histogram of their counts. The plotting code for an interactive histogram is already given in the notebook.

Handin: show the frequency distribution histogram.

Training Word2Vec

Now that we have a processed list of sentences, let's run the word2vec training. Begin by reading the gensim documentation for word2vec at https://radimrehurek.com/gensim/models/word2vec.html, to figure out how to use the Word2Vec class. Learn embeddings in $\mathbb R^{100}$ using CBOW (which is the default). Other options should be default except min_count=10 so that infrequent words are ignored. The training process should take under half a minute.

If your trained Word2Vec instance is called model_ted, you should be able to check the vocabulary size using len(model_ted.vocab), which should be around 14427. Try using the most_similar() method to return a list of the most similar words to "man" and "computer".

Handin: find a few more words with interesting and/or surprising nearest neighbours.

Handin: find an interesting cluster in the t-SNE plot.

Optional, for enthusiastic students: try manually retrieving two word vectors using the indexing operator as described in gensim's documentation, then computer their cosine distances (recall it is defined as $d(x,y) = \frac{\langle x, y \rangle}{|x||y|}$). You may be interested in np.dot() and np.linalg.norm(), see the numpy documentation for details. Compare this to the distance computed by gensim's functions.

Comparison to vectors trained on WikiText-2 data

We have provided downloading/preprocessing code (similar to the previous code) for the WikiText-2 dataset. The code uses a random subsample of the data so it is comparable in size to the TED Talk data.

Repeat the same analysis as above but on this dataset.

Handin: find a few words with similar nearest neighbours.

Handin: find an interesting cluster in the t-SNE plot.

Handin: Are there any notable differences between the embeddings learned on data compared to those learned on the TED Talk data?

(Optional, for enthusiastic students) Clustering

If you have extra time, try performing a k-means clustering (e.g. using sklearn.cluster.kmeans) on the embeddings, tuning the number of clusters until you get interesting or meaningful clusters.

Handin

See the bolded "Handin:" parts above. On paper or verbally, show a practical demonstrator your response to these to get signed off.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].