All Projects → dperezrada → Keywords2vec

dperezrada / Keywords2vec

Licence: apache-2.0

Projects that are alternatives of or similar to Keywords2vec

Gwu data mining
Materials for GWU DNSC 6279 and DNSC 6290.
Stars: ✭ 217 (+79.34%)
Mutual labels:  jupyter-notebook, text-mining
Nlpython
This repository contains the code related to Natural Language Processing using python scripting language. All the codes are related to my book entitled "Python Natural Language Processing"
Stars: ✭ 265 (+119.01%)
Mutual labels:  jupyter-notebook, text-mining
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (+49.59%)
Mutual labels:  jupyter-notebook, text-mining
2018 Machinelearning Lectures Esa
Machine Learning Lectures at the European Space Agency (ESA) in 2018
Stars: ✭ 280 (+131.4%)
Mutual labels:  jupyter-notebook, text-mining
Autophrase
AutoPhrase: Automated Phrase Mining from Massive Text Corpora
Stars: ✭ 835 (+590.08%)
Mutual labels:  multi-language, text-mining
Applied Text Mining In Python
Repo for Applied Text Mining in Python (coursera) by University of Michigan
Stars: ✭ 59 (-51.24%)
Mutual labels:  jupyter-notebook, text-mining
Aravec
AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.
Stars: ✭ 239 (+97.52%)
Mutual labels:  jupyter-notebook, text-mining
Nlp Notebooks
A collection of notebooks for Natural Language Processing from NLP Town
Stars: ✭ 513 (+323.97%)
Mutual labels:  jupyter-notebook, text-mining
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+552.89%)
Mutual labels:  jupyter-notebook, text-mining
Text Mining
Text Mining in Python
Stars: ✭ 18 (-85.12%)
Mutual labels:  jupyter-notebook, text-mining
Python nlp tutorial
This repository provides everything to get started with Python for Text Mining / Natural Language Processing (NLP)
Stars: ✭ 72 (-40.5%)
Mutual labels:  jupyter-notebook, text-mining
Yolov3 Point
从零开始学习YOLOv3教程解读代码+注意力模块(SE,SPP,RFB etc)
Stars: ✭ 119 (-1.65%)
Mutual labels:  jupyter-notebook
Tfeat
TFeat descriptor models for BMVC 2016 paper "Learning local feature descriptors with triplets and shallow convolutional neural networks"
Stars: ✭ 119 (-1.65%)
Mutual labels:  jupyter-notebook
Qml Rg
Quantum Machine Learning Reading Group @ ICFO
Stars: ✭ 120 (-0.83%)
Mutual labels:  jupyter-notebook
Keras transfer cifar10
Object classification with CIFAR-10 using transfer learning
Stars: ✭ 120 (-0.83%)
Mutual labels:  jupyter-notebook
Research public
Quantitative research and educational materials
Stars: ✭ 1,776 (+1367.77%)
Mutual labels:  jupyter-notebook
Limperg python
Repository with material for the Limperg Python course by Ties de Kok.
Stars: ✭ 121 (+0%)
Mutual labels:  jupyter-notebook
Capsule Gan
Code for my Master thesis on "Capsule Architecture as a Discriminator in Generative Adversarial Networks".
Stars: ✭ 120 (-0.83%)
Mutual labels:  jupyter-notebook
Hierarchical Attention Network
Implementation of Hierarchical Attention Networks in PyTorch
Stars: ✭ 120 (-0.83%)
Mutual labels:  jupyter-notebook
Ipynb playground
Various ipython notebooks
Stars: ✭ 1,531 (+1165.29%)
Mutual labels:  jupyter-notebook

keywords2vec

A simple and fast way to generate a word2vec model, with multi-word keywords instead of single words.

Example result

Finding similar keywords for "obesity"

index term
0 overweight
1 obese
2 physical inactivity
3 excess weight
4 obese adults
5 high bmi
6 obese adults
7 obese people
8 obesity-related outcomes
9 obesity among children
10 poor sleep quality
11 ssbs
12 obese populations
13 cardiometabolic risk
14 abdominal obesity

Install

pip install keywords2vec

How to use

Lets download some example data

data_filepath = "epistemonikos_data_sample.tsv.gz"

!wget "https://s3.amazonaws.com/episte-labs/epistemonikos_data_sample.tsv.gz" -O "{data_filepath}"

We create the model. If you need the vectors, take a look here

labels, tree = similars_tree(data_filepath)
processing file: epistemonikos_data_sample.tsv.gz
<style> /* Turns off some styling */ progress { /* gets rid of default border in Firefox and Opera. */ border: none; /* Needs to be in here for Safari polyfill so background images work as expected. */ background-size: auto; } .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar { background: #F44336; } </style> 100.00% [201/201 00:19<00:00]

Then we can get the most similars keywords

get_similars(tree, labels, "obesity")
['obesity',
 'overweight',
 'obese',
 'physical inactivity',
 'excess weight',
 'high bmi',
 'obese adults',
 'obese people',
 'obesity-related outcomes',
 'obesity among children',
 'poor sleep quality',
 'ssbs',
 'obese populations',
 'cardiometabolic risk',
 'abdominal obesity']
get_similars(tree, labels, "heart failure")
['heart failure',
 'hf',
 'chf',
 'chronic heart failure',
 'reduced ejection fraction',
 'unstable angina',
 'peripheral vascular disease',
 'peripheral arterial disease',
 'angina',
 'congestive heart failure',
 'left ventricular systolic dysfunction',
 'acute coronary syndrome',
 'heart failure patients',
 'acute myocardial infarction',
 'left ventricular dysfunction']

Motivation

The idea started in the Epistemonikos database www.epistemonikos.org, a database of scientific articles for people making decisions concerning clinical or health-policy questions. In this context the scientific/health language used is complex. You can easily find keywords like:

  • asthma
  • heart failure
  • medial compartment knee osteoarthritis
  • preserved left ventricular systolic function
  • non-selective non-steroidal anti-inflammatory drugs

We tried some approaches to find those keywords, like ngrams, ngrams + tf-idf, identify entities, among others. But we didn't get really good results.

Our approach

We found that tokenizing using stopwords + non word characters was really useful for "finding" the keywords. An example:

  • input: "Timing of replacement therapy for acute renal failure after cardiac surgery"
  • output: [ "timing", "replacement therapy", "acute renal failure", "cardiac surgery" ]

So we basically split the text when we find:

  • a stopword
  • a non word character(/,!?. etc) (except from - and ')

That's it.

But as there were some problem with some keywords that cointain stopwords, like:

  • Vitamin A
  • Hepatitis A
  • Web of Science

So we decided to add another method (nltk with some grammar definition) to cover most of the cases. To use this, you need to add the parameter keywords_w_stopwords=True, this method is approx 20x slower.

References

Seem to be an old idea (2004):

Mihalcea, Rada, and Paul Tarau. "Textrank: Bringing order into text." Proceedings of the 2004 conference on empirical methods in natural language processing. 2004.

Reading an implementation of textrank, I realize they used stopwords to separate and create the graph. Then I though in using it as tokenizer for word2vec

As pointed by @deliprao in this twitter thread. It's also used by Rake (2010):

Rose, Stuart & Engel, Dave & Cramer, Nick & Cowley, Wendy. (2010). Automatic Keyword Extraction from Individual Documents. 10.1002/9780470689646.ch1.

As noted by @astent in the Twitter thread, this concept is called chinking (chunking by exclusion) https://www.nltk.org/book/ch07.html#Chinking

Multi-lingual

We worked in an implementation, that could be used in multiple languages. Of course not all languages are sutable for using this approach. We have tried with good results in English, Spanish and Portuguese

Try it online

You can try it here (takes time to load, lowercase only, doesn't work in mobile yet) MPV :)

These embedding were created using 827,341 title/abstract from @epistemonikos database. With keywords that repeat at least 10 times. The total vocab is 349,080 keywords (really manageable number)

Vocab size

One of the main benefit of this method, is the size of the vocabulary. For example, using keywords that repeat at least 10 times, for the Epistemonikos dataset (827,341 title/abstract), we got the following vocab size:

ngrams keywords comp
1 127,824 36%
1,2 1,360,550 388%
1-3 3,204,099 914%
1-4 4,461,930 1,272%
1-5 5,133,619 1,464%
stopword tokenizer 350,529 100%

More information regarding the comparison, take a look to the folder analyze.

Credits

This project has been created using nbdev

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].