All Projects → amansrivastava17 → bns-short-text-similarity

amansrivastava17 / bns-short-text-similarity

Licence: MIT license
📖 Use Bi-normal Separation to find document vectors which is used to compute similarity for shorter sentences.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to bns-short-text-similarity

Naive-Resume-Matching
Text Similarity Applied to resume, to compare Resumes with Job Descriptions and create a score to rank them. Similar to an ATS.
Stars: ✭ 27 (+12.5%)
Mutual labels:  text-classification, text-similarity, cosine-similarity
text analysis tools
中文文本分析工具包(包括- 文本分类 - 文本聚类 - 文本相似性 - 关键词抽取 - 关键短语抽取 - 情感分析 - 文本纠错 - 文本摘要 - 主题关键词-同义词、近义词-事件三元组抽取)
Stars: ✭ 410 (+1608.33%)
Mutual labels:  text-classification, text-similarity
Nepali-News-Classifier
Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.
Stars: ✭ 13 (-45.83%)
Mutual labels:  text-classification, tf-idf
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+237.5%)
Mutual labels:  text-classification, tf-idf
Content-based-Recommender-System
It is a content based recommender system that uses tf-idf and cosine similarity for N Most SImilar Items from a dataset
Stars: ✭ 64 (+166.67%)
Mutual labels:  tf-idf, cosine-similarity
koolsla
Food recommendation tool with Machine learning.
Stars: ✭ 21 (-12.5%)
Mutual labels:  tf-idf, cosine-similarity
text-classification-baseline
Pipeline for fast building text classification TF-IDF + LogReg baselines.
Stars: ✭ 55 (+129.17%)
Mutual labels:  text-classification, tf-idf
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+8700%)
Mutual labels:  text-classification, text-similarity
TorchBlocks
A PyTorch-based toolkit for natural language processing
Stars: ✭ 85 (+254.17%)
Mutual labels:  text-classification, text-similarity
textgo
Text preprocessing, representation, similarity calculation, text search and classification. Let's go and play with text!
Stars: ✭ 33 (+37.5%)
Mutual labels:  text-classification, text-similarity
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+3191.67%)
Mutual labels:  text-classification, tf-idf
Textvec
Text vectorization tool to outperform TFIDF for classification tasks
Stars: ✭ 167 (+595.83%)
Mutual labels:  text-classification, tf-idf
Chinese text cnn
TextCNN Pytorch实现 中文文本分类 情感分析
Stars: ✭ 235 (+879.17%)
Mutual labels:  text-classification
Indonesian-Twitter-Emotion-Dataset
Indonesian twitter dataset for emotion classification task
Stars: ✭ 49 (+104.17%)
Mutual labels:  text-classification
Fancy Nlp
NLP for human. A fast and easy-to-use natural language processing (NLP) toolkit, satisfying your imagination about NLP.
Stars: ✭ 233 (+870.83%)
Mutual labels:  text-classification
Pytorch Transformers Classification
Based on the Pytorch-Transformers library by HuggingFace. To be used as a starting point for employing Transformer models in text classification tasks. Contains code to easily train BERT, XLNet, RoBERTa, and XLM models for text classification.
Stars: ✭ 229 (+854.17%)
Mutual labels:  text-classification
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Stars: ✭ 68 (+183.33%)
Mutual labels:  text-classification
Document-Classification-using-LSA
Document classification using Latent semantic analysis in python
Stars: ✭ 16 (-33.33%)
Mutual labels:  tf-idf
Catalyst
Accelerated deep learning R&D
Stars: ✭ 2,804 (+11583.33%)
Mutual labels:  text-classification
Paddlenlp
NLP Core Library and Model Zoo based on PaddlePaddle 2.0
Stars: ✭ 212 (+783.33%)
Mutual labels:  text-classification

BNS Vectorizer - Improved TF-IDF for shorter text

Bi-normal Separation is a popular method to score textual data importance against its belonging category, it can efficiently find out important keywords in a document and assign a weighted positive score, also provide negative scoring for unimportant word for a document.

Below are the description of variables used to calculate Bi-normal separation score for a word for each category (or classes).

Why BNS Better than TF-IDF?

Due to the short length of the documents, the existing approaches of TF-IDF and other term frequency bases approaches, did not perform well as there are usually no words that occur more than once per document, so we need to use an approach that does not rely on term frequency within the document.

BNS overcomes this problem as it assign weights to each term based on their occurance in positive and negative categories (or classes). A term occurs often in the positive samples and seldom in negative ones, will get a high BNS weight.

Also as idf have a general value for term across categories, bns assign different weightage score for term in different category.

Formula to calculate BNS:

  • pos = number of positive training cases, typically minority,

  • neg = number of negative training cases,

  • tp = number of positive training cases containing word,

  • fp = number of negative training cases containing word,

  • fn = pos - tp,

  • tn = neg - fp,

  • tpr (true positive rate) = P(word | positive class) = tp/pos*

  • fpr (false positive rate) = P(word | negative class) = fp/neg,

  • bns (Bi-Normal Separation) = F^(-1)(tpr) – F^(-1)(fpr)

    F^(-1) is the inverse Normal cumulative distribution function

Usage:

Create BNS Vectorizer
from bns import BNS
documents = ['please book flights to mumbai', 'need a flight to goa', 'airline price for 			   2 adult', 'plan a trip to goa', 'book a taxi for me', 'book ola for home',              'show uber around me', 'nearby gym around me', 'nearby by temple',
             'i want to know nearby cinema hall in mumbai']

categories = ['book_flight', 'book_flight', 'book_flight', 'book_flight', 'book_taxi', 				  'book_taxi', 'book_taxi', 'nearby', 'nearby', 'nearby']

BNS_VECTORIZER = BNS()
BNS_VECTORIZER.fit(documents, categories)
Calculate Cosine similarity
from operator import itemgetter
from sklearn.metrics.pairwise import cosine_similarity

test_documents = ['book me a flight please']
test_bns_vectors = BNS_VECTORIZER.transform(test_documents)

# Lets find most similar sentence and category for given test document
results = []
for category in test_bns_vectors.keys():
    vector = test_bns_vectors[category]
    category_trained_sentence_vectors = BNS_VECTORIZER.vectors[category]
    category_trained_sentence = BNS_VECTORIZER.sentences_category_map[category]
    cosine_scores = cosine_similarity(vector, category_trained_sentence_vectors)[0]
    for score, sent in zip(cosine_scores, category_trained_sentence):
        results.append({'match_sentence':sent, 'category': category, 'score':score})

results = sorted(results, key=itemgetter('score'), reverse=True)
for each in results:
    print each

Above similarity method might not produce good results as there are no preprocessing involved, here you can refer to my previous repository to perform various text preprocessing involved before sending documents for bns vectorizer creation.

link : text preprocessing python

There are still lots of improvement needed to compute similarity for shorter sentences, you must try the above methods and let me know if you have any improvements and suggestions

Thanks !!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].