Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

My Keras implementation of the Deep Semantic Similarity Model (DSSM)/Convolutional Latent Semantic Model (CLSM) described here: http://research.microsoft.com/pubs/226585/cikm2014_cdssm_final.pdf.

Stars: ✭ 509 (+842.59%)

Mutual labels: information-retrieval, natural-language-processing

Nlp Recipes

Natural Language Processing Best Practices & Examples

Stars: ✭ 5,783 (+10609.26%)

Mutual labels: natural-language-processing, text-classification

Spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars: ✭ 21,978 (+40600%)

Mutual labels: natural-language-processing, text-classification

Awesome Persian Nlp Ir

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Stars: ✭ 460 (+751.85%)

Mutual labels: information-retrieval, natural-language-processing

Talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.

Stars: ✭ 584 (+981.48%)

Mutual labels: information-retrieval, natural-language-processing

Nlp Projects

word2vec, sentence2vec, machine reading comprehension, dialog system, text classification, pretrained language model (i.e., XLNet, BERT, ELMo, GPT), sequence labeling, information retrieval, information extraction (i.e., entity, relation and event extraction), knowledge graph, text generation, network embedding

Stars: ✭ 360 (+566.67%)

Mutual labels: information-retrieval, text-classification

Nlp In Practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+1362.96%)

Mutual labels: natural-language-processing, text-classification

Drl4nlp.scratchpad

Notes on Deep Reinforcement Learning for Natural Language Processing papers

Stars: ✭ 26 (-51.85%)

Mutual labels: information-retrieval, natural-language-processing

Knowledge Graphs

A collection of research on knowledge graphs

Stars: ✭ 845 (+1464.81%)

Mutual labels: information-retrieval, natural-language-processing

Sequence Semantic Embedding

Tools and recipes to train deep learning models and build services for NLP tasks such as text classification, semantic search ranking and recall fetching, cross-lingual information retrieval, and question answering etc.

Stars: ✭ 435 (+705.56%)

Mutual labels: information-retrieval, text-classification

Rmdl

RMDL: Random Multimodel Deep Learning for Classification

Stars: ✭ 375 (+594.44%)

Mutual labels: information-retrieval, text-classification

Cdqa

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.

Stars: ✭ 500 (+825.93%)

Mutual labels: information-retrieval, natural-language-processing

Spacy Streamlit

👑 spaCy building blocks and visualizers for Streamlit apps

Stars: ✭ 360 (+566.67%)

Mutual labels: natural-language-processing, text-classification

Pythoncode Tutorials

The Python Code Tutorials

Stars: ✭ 544 (+907.41%)

Mutual labels: natural-language-processing, text-classification

Easy Deep Learning With Allennlp

🔮Deep Learning for text made easy with AllenNLP

Stars: ✭ 32 (-40.74%)

Mutual labels: natural-language-processing, text-classification

Textfooler

A Model for Natural Language Attack on Text Classification and Inference

Stars: ✭ 298 (+451.85%)

Mutual labels: natural-language-processing, text-classification

Text mining resources

Resources for learning about Text Mining and Natural Language Processing

Stars: ✭ 358 (+562.96%)

Mutual labels: natural-language-processing, text-classification

Wikipedia2vec

A tool for learning vector representations of words and entities from Wikipedia

Stars: ✭ 655 (+1112.96%)

Mutual labels: natural-language-processing, text-classification

View All Similar Projects ➔

Text Classification with Sparse Composite Document Vectors (SCDV)

Introduction

For text classification and information retrieval tasks, text data has to be represented as a fixed dimension vector.
We propose simple feature construction technique named SCDV: Sparse Composite Document Vectors using soft clustering over distributional representations. presented at EMNLP 2017.
We demonstrate our method through experiments on multi-class classification on 20newsgroup dataset and multi-label text classification on Reuters-21578 dataset.

Citation

If you find SCDV useful in your research, please consider citing:

@inproceedings{mekala2017scdv,
  title={SCDV: Sparse Composite Document Vectors using soft clustering over distributional representations},
  author={Mekala, Dheeraj and Gupta, Vivek and Paranjape, Bhargavi and Karnick, Harish},
  booktitle={Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  pages={659--669},
  year={2017}
}

New Features

Python 3.7 As Python2 is deprecated, whole repository is moved to Python3.7.
Support FastText. The word vectors can be trained through Word2Vec or FastText.

Testing

There are 2 folders named 20news and Reuters which contains code related to multi-class classification on 20Newsgroup dataset and multi-label classification on Reuters dataset.

20Newsgroup

Change directory to 20news for experimenting on 20Newsgroup dataset and create train and test tsv files as follows:

$ cd 20news
$ python create_tsv.py

Get word vectors for all words in vocabulary through Word2Vec:

$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.

Get word vectors for all words in vocabulary through FastText:

$ python FastText.py 200
# FastText.py takes word vector dimension as an argument. We took it as 200.

Get Sparse Document Vectors (SCDV) for documents in train and test set and accuracy of prediction on test set:

$ python SCDV.py 200 60 model_type
# SCDV.py takes word vector dimension, number of clusters as arguments and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 200 and number of clusters as 60.

Get Topic coherence for documents in train set:

$ python TopicCoherence.py 200 60 10 model_type
# TopicCoherence.py takes word vector dimension, number of clusters, number of top words and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 200, number of clusters as 60 and number of top words as 10.

Reuters

Change directory to Reuters for experimenting on Reuters-21578 dataset. As reuters data is in SGML format, parsing data and creating pickle file of parsed data can be done as follows:

$ python create_data.py
# We don't save train and test files locally. We split data into train and test whenever needed.

Get word vectors for all words in vocabulary through Word2Vec:

$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.

Get word vectors for all words in vocabulary through FastText:

$ python FastText.py 200
# FastText.py takes word vector dimension as an argument. We took it as 200.

Get Sparse Document Vectors (SCDV) for documents in train and test set and accuracy of prediction on test set:

$ python SCDV.py 200 60 model_type
# SCDV.py takes word vector dimension, number of clusters as arguments and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 200 and number of clusters as 60.

Get performance metrics on test set:

$ python metrics.py 200 60
# metrics.py takes word vector dimension and number of clusters as arguments. We took word vector dimension as 200 and number of clusters as 60.

Information Retrieval

Change directory to IR for experimenting on information Retrieval task. IR Datasets mentioned in the paper can be downloaded from TREC website.

You will need to run the documents and queries through a full fledged IR pipeline system like Apache Lucene or Project Lemur in order to

Tokenize the data, remove stop words and pass tokens through a Porter Stemmer.
Build inverted and forward index.
Build a basic language model retrieval system with Dirichlet smoothing.

Data Format

The IR Data folder must have a file called "queries.txt" and a folder called raw that has all the documents.
Each file in raw should be a single document containing space separated processed tokens. File must be named as doc_ID.txt.
Each line in queries.txt should be a single query containing space separated processed words.

To interpolate language model retrieval system with the query-document score obtained from SCDV:

Get word vectors for all terms in vocabulary through Word2Vec:

$ python Word2Vec.py 300 sjm
# Word2Vec.py takes word vector dimension and folder containing IR dataset as arguments. We took 300 and sjm (San Jose Mercury).

Get word vectors for all terms in vocabulary through FastText:

$ python FastText.py 300 sjm
# FastText.py takes word vector dimension and folder containing IR dataset as arguments. We took 300 and sjm (San Jose Mercury).

Create Sparse Document Vectors (SCDV) for all documents and queries and compute similarity scores for all query-document pairs.

$ python SCDV.py 300 100 sjm model_type
# SCDV.py takes word vector dimension, number of clusters, folder containing IR dataset, and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 300, number of clusters as 100, and folder as sjm.
# Change the code to store these scores in a format that can be used by the IR system.

Use these scores to interpolate with the language model scores with interpolation parameter 0.5.

Requirements

Minimum requirements:

Python 3.7
NumPy 1.17.2
Scikit-learn 0.23.1
Pandas 0.25.1
Gensim 3.8.1
sgmllib3k

For theory and explanation of SCDV, please visit our EMNLP 2017 paper, BLOG.

Note: You need not download 20Newsgroup or Reuters-21578 dataset. All datasets are present in their respective directories. We used SGMl parser for parsing Reuters-21578 dataset from here

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 54

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗