All Projects → dheeraj7596 → Scdv

dheeraj7596 / Scdv

Licence: mit
Text classification with Sparse Composite Document Vectors.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Scdv

Catalyst
Accelerated deep learning R&D
Stars: ✭ 2,804 (+5092.59%)
Mutual labels:  information-retrieval, natural-language-processing, text-classification
Hanlp
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Stars: ✭ 24,626 (+45503.7%)
Mutual labels:  natural-language-processing, text-classification
Deep Semantic Similarity Model
My Keras implementation of the Deep Semantic Similarity Model (DSSM)/Convolutional Latent Semantic Model (CLSM) described here: http://research.microsoft.com/pubs/226585/cikm2014_cdssm_final.pdf.
Stars: ✭ 509 (+842.59%)
Mutual labels:  information-retrieval, natural-language-processing
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+10609.26%)
Mutual labels:  natural-language-processing, text-classification
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+40600%)
Mutual labels:  natural-language-processing, text-classification
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+751.85%)
Mutual labels:  information-retrieval, natural-language-processing
Talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Stars: ✭ 584 (+981.48%)
Mutual labels:  information-retrieval, natural-language-processing
Nlp Projects
word2vec, sentence2vec, machine reading comprehension, dialog system, text classification, pretrained language model (i.e., XLNet, BERT, ELMo, GPT), sequence labeling, information retrieval, information extraction (i.e., entity, relation and event extraction), knowledge graph, text generation, network embedding
Stars: ✭ 360 (+566.67%)
Mutual labels:  information-retrieval, text-classification
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+1362.96%)
Mutual labels:  natural-language-processing, text-classification
Drl4nlp.scratchpad
Notes on Deep Reinforcement Learning for Natural Language Processing papers
Stars: ✭ 26 (-51.85%)
Mutual labels:  information-retrieval, natural-language-processing
Knowledge Graphs
A collection of research on knowledge graphs
Stars: ✭ 845 (+1464.81%)
Mutual labels:  information-retrieval, natural-language-processing
Sequence Semantic Embedding
Tools and recipes to train deep learning models and build services for NLP tasks such as text classification, semantic search ranking and recall fetching, cross-lingual information retrieval, and question answering etc.
Stars: ✭ 435 (+705.56%)
Mutual labels:  information-retrieval, text-classification
Rmdl
RMDL: Random Multimodel Deep Learning for Classification
Stars: ✭ 375 (+594.44%)
Mutual labels:  information-retrieval, text-classification
Cdqa
⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
Stars: ✭ 500 (+825.93%)
Mutual labels:  information-retrieval, natural-language-processing
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (+566.67%)
Mutual labels:  natural-language-processing, text-classification
Pythoncode Tutorials
The Python Code Tutorials
Stars: ✭ 544 (+907.41%)
Mutual labels:  natural-language-processing, text-classification
Easy Deep Learning With Allennlp
🔮Deep Learning for text made easy with AllenNLP
Stars: ✭ 32 (-40.74%)
Mutual labels:  natural-language-processing, text-classification
Textfooler
A Model for Natural Language Attack on Text Classification and Inference
Stars: ✭ 298 (+451.85%)
Mutual labels:  natural-language-processing, text-classification
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+562.96%)
Mutual labels:  natural-language-processing, text-classification
Wikipedia2vec
A tool for learning vector representations of words and entities from Wikipedia
Stars: ✭ 655 (+1112.96%)
Mutual labels:  natural-language-processing, text-classification

Text Classification with Sparse Composite Document Vectors (SCDV)

Introduction

Citation

If you find SCDV useful in your research, please consider citing:

@inproceedings{mekala2017scdv,
  title={SCDV: Sparse Composite Document Vectors using soft clustering over distributional representations},
  author={Mekala, Dheeraj and Gupta, Vivek and Paranjape, Bhargavi and Karnick, Harish},
  booktitle={Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  pages={659--669},
  year={2017}
}

New Features

  • Python 3.7 As Python2 is deprecated, whole repository is moved to Python3.7.
  • Support FastText. The word vectors can be trained through Word2Vec or FastText.

Testing

There are 2 folders named 20news and Reuters which contains code related to multi-class classification on 20Newsgroup dataset and multi-label classification on Reuters dataset.

20Newsgroup

Change directory to 20news for experimenting on 20Newsgroup dataset and create train and test tsv files as follows:

$ cd 20news
$ python create_tsv.py

Get word vectors for all words in vocabulary through Word2Vec:

$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.

Get word vectors for all words in vocabulary through FastText:

$ python FastText.py 200
# FastText.py takes word vector dimension as an argument. We took it as 200.

Get Sparse Document Vectors (SCDV) for documents in train and test set and accuracy of prediction on test set:

$ python SCDV.py 200 60 model_type
# SCDV.py takes word vector dimension, number of clusters as arguments and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 200 and number of clusters as 60.

Get Topic coherence for documents in train set:

$ python TopicCoherence.py 200 60 10 model_type
# TopicCoherence.py takes word vector dimension, number of clusters, number of top words and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 200, number of clusters as 60 and number of top words as 10.

Reuters

Change directory to Reuters for experimenting on Reuters-21578 dataset. As reuters data is in SGML format, parsing data and creating pickle file of parsed data can be done as follows:

$ python create_data.py
# We don't save train and test files locally. We split data into train and test whenever needed.

Get word vectors for all words in vocabulary through Word2Vec:

$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.

Get word vectors for all words in vocabulary through FastText:

$ python FastText.py 200
# FastText.py takes word vector dimension as an argument. We took it as 200.

Get Sparse Document Vectors (SCDV) for documents in train and test set and accuracy of prediction on test set:

$ python SCDV.py 200 60 model_type
# SCDV.py takes word vector dimension, number of clusters as arguments and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 200 and number of clusters as 60.

Get performance metrics on test set:

$ python metrics.py 200 60
# metrics.py takes word vector dimension and number of clusters as arguments. We took word vector dimension as 200 and number of clusters as 60.

Information Retrieval

Change directory to IR for experimenting on information Retrieval task. IR Datasets mentioned in the paper can be downloaded from TREC website.

You will need to run the documents and queries through a full fledged IR pipeline system like Apache Lucene or Project Lemur in order to

  • Tokenize the data, remove stop words and pass tokens through a Porter Stemmer.
  • Build inverted and forward index.
  • Build a basic language model retrieval system with Dirichlet smoothing.

Data Format

  • The IR Data folder must have a file called "queries.txt" and a folder called raw that has all the documents.
  • Each file in raw should be a single document containing space separated processed tokens. File must be named as doc_ID.txt.
  • Each line in queries.txt should be a single query containing space separated processed words.

To interpolate language model retrieval system with the query-document score obtained from SCDV:

Get word vectors for all terms in vocabulary through Word2Vec:

$ python Word2Vec.py 300 sjm
# Word2Vec.py takes word vector dimension and folder containing IR dataset as arguments. We took 300 and sjm (San Jose Mercury).

Get word vectors for all terms in vocabulary through FastText:

$ python FastText.py 300 sjm
# FastText.py takes word vector dimension and folder containing IR dataset as arguments. We took 300 and sjm (San Jose Mercury).

Create Sparse Document Vectors (SCDV) for all documents and queries and compute similarity scores for all query-document pairs.

$ python SCDV.py 300 100 sjm model_type
# SCDV.py takes word vector dimension, number of clusters, folder containing IR dataset, and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 300, number of clusters as 100, and folder as sjm.
# Change the code to store these scores in a format that can be used by the IR system.

Use these scores to interpolate with the language model scores with interpolation parameter 0.5.

Requirements

Minimum requirements:

  • Python 3.7
  • NumPy 1.17.2
  • Scikit-learn 0.23.1
  • Pandas 0.25.1
  • Gensim 3.8.1
  • sgmllib3k

For theory and explanation of SCDV, please visit our EMNLP 2017 paper, BLOG.

Note: You need not download 20Newsgroup or Reuters-21578 dataset. All datasets are present in their respective directories. We used SGMl parser for parsing Reuters-21578 dataset from here

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].