Alternatives and detailed information of bns-short-text-similarity

amansrivastava17 / bns-short-text-similarity

Licence: MIT license

📖 Use Bi-normal Separation to find document vectors which is used to compute similarity for shorter sentences.

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to bns-short-text-similarity

Text Similarity Applied to resume, to compare Resumes with Job Descriptions and create a score to rank them. Similar to an ATS.

Stars: ✭ 27 (+12.5%)

Mutual labels: text-classification, text-similarity, cosine-similarity

text analysis tools

中文文本分析工具包（包括- 文本分类 - 文本聚类 - 文本相似性 - 关键词抽取 - 关键短语抽取 - 情感分析 - 文本纠错 - 文本摘要 - 主题关键词-同义词、近义词-事件三元组抽取）

Stars: ✭ 410 (+1608.33%)

Mutual labels: text-classification, text-similarity

Nepali-News-Classifier

Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.

Stars: ✭ 13 (-45.83%)

Mutual labels: text-classification, tf-idf

text-classification-cn

中文文本分类实践，基于搜狗新闻语料库，采用传统机器学习方法以及预训练模型等方法

Stars: ✭ 81 (+237.5%)

Mutual labels: text-classification, tf-idf

Content-based-Recommender-System

It is a content based recommender system that uses tf-idf and cosine similarity for N Most SImilar Items from a dataset

Stars: ✭ 64 (+166.67%)

Mutual labels: tf-idf, cosine-similarity

koolsla

Food recommendation tool with Machine learning.

Stars: ✭ 21 (-12.5%)

Mutual labels: tf-idf, cosine-similarity

text-classification-baseline

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Stars: ✭ 55 (+129.17%)

Mutual labels: text-classification, tf-idf

Cluedatasetsearch

搜索所有中文NLP数据集，附常用英文NLP数据集

Stars: ✭ 2,112 (+8700%)

Mutual labels: text-classification, text-similarity

TorchBlocks

A PyTorch-based toolkit for natural language processing

Stars: ✭ 85 (+254.17%)

Mutual labels: text-classification, text-similarity

textgo

Text preprocessing, representation, similarity calculation, text search and classification. Let's go and play with text!

Stars: ✭ 33 (+37.5%)

Mutual labels: text-classification, text-similarity

Nlp In Practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+3191.67%)

Mutual labels: text-classification, tf-idf

Textvec

Text vectorization tool to outperform TFIDF for classification tasks

Stars: ✭ 167 (+595.83%)

Mutual labels: text-classification, tf-idf

Chinese text cnn

TextCNN Pytorch实现中文文本分类情感分析

Stars: ✭ 235 (+879.17%)

Mutual labels: text-classification

Indonesian-Twitter-Emotion-Dataset

Indonesian twitter dataset for emotion classification task

Stars: ✭ 49 (+104.17%)

Mutual labels: text-classification

Fancy Nlp

NLP for human. A fast and easy-to-use natural language processing (NLP) toolkit, satisfying your imagination about NLP.

Stars: ✭ 233 (+870.83%)

Mutual labels: text-classification

Pytorch Transformers Classification

Based on the Pytorch-Transformers library by HuggingFace. To be used as a starting point for employing Transformer models in text classification tasks. Contains code to easily train BERT, XLNet, RoBERTa, and XLM models for text classification.

Stars: ✭ 229 (+854.17%)

Mutual labels: text-classification

Vaaku2Vec

Language Modeling and Text Classification in Malayalam Language using ULMFiT

Stars: ✭ 68 (+183.33%)

Mutual labels: text-classification

Document-Classification-using-LSA

Document classification using Latent semantic analysis in python

Stars: ✭ 16 (-33.33%)

Mutual labels: tf-idf

Catalyst

Accelerated deep learning R&D

Stars: ✭ 2,804 (+11583.33%)

Mutual labels: text-classification

Paddlenlp

NLP Core Library and Model Zoo based on PaddlePaddle 2.0

Stars: ✭ 212 (+783.33%)

Mutual labels: text-classification

View All Similar Projects ➔

BNS Vectorizer - Improved TF-IDF for shorter text

Bi-normal Separation is a popular method to score textual data importance against its belonging category, it can efficiently find out important keywords in a document and assign a weighted positive score, also provide negative scoring for unimportant word for a document.

Below are the description of variables used to calculate Bi-normal separation score for a word for each category (or classes).

Why BNS Better than TF-IDF?

Due to the short length of the documents, the existing approaches of TF-IDF and other term frequency bases approaches, did not perform well as there are usually no words that occur more than once per document, so we need to use an approach that does not rely on term frequency within the document.

BNS overcomes this problem as it assign weights to each term based on their occurance in positive and negative categories (or classes). A term occurs often in the positive samples and seldom in negative ones, will get a high BNS weight.

Also as idf have a general value for term across categories, bns assign different weightage score for term in different category.

Formula to calculate BNS:

pos = number of positive training cases, typically minority,
neg = number of negative training cases,
tp = number of positive training cases containing word,
fp = number of negative training cases containing word,
fn = pos - tp,
tn = neg - fp,
tpr (true positive rate) = P(word | positive class) = tp/pos*
fpr (false positive rate) = P(word | negative class) = fp/neg,
bns (Bi-Normal Separation) = F^(-1)(tpr) – F^(-1)(fpr)

F^(-1) is the inverse Normal cumulative distribution function

Usage:

Create BNS Vectorizer

from bns import BNS
documents = ['please book flights to mumbai', 'need a flight to goa', 'airline price for 			   2 adult', 'plan a trip to goa', 'book a taxi for me', 'book ola for home',              'show uber around me', 'nearby gym around me', 'nearby by temple',
             'i want to know nearby cinema hall in mumbai']

categories = ['book_flight', 'book_flight', 'book_flight', 'book_flight', 'book_taxi', 				  'book_taxi', 'book_taxi', 'nearby', 'nearby', 'nearby']

BNS_VECTORIZER = BNS()
BNS_VECTORIZER.fit(documents, categories)

Calculate Cosine similarity

from operator import itemgetter
from sklearn.metrics.pairwise import cosine_similarity

test_documents = ['book me a flight please']
test_bns_vectors = BNS_VECTORIZER.transform(test_documents)

# Lets find most similar sentence and category for given test document
results = []
for category in test_bns_vectors.keys():
    vector = test_bns_vectors[category]
    category_trained_sentence_vectors = BNS_VECTORIZER.vectors[category]
    category_trained_sentence = BNS_VECTORIZER.sentences_category_map[category]
    cosine_scores = cosine_similarity(vector, category_trained_sentence_vectors)[0]
    for score, sent in zip(cosine_scores, category_trained_sentence):
        results.append({'match_sentence':sent, 'category': category, 'score':score})

results = sorted(results, key=itemgetter('score'), reverse=True)
for each in results:
    print each

Above similarity method might not produce good results as there are no preprocessing involved, here you can refer to my previous repository to perform various text preprocessing involved before sending documents for bns vectorizer creation.

link : text preprocessing python

There are still lots of improvement needed to compute similarity for shorter sentences, you must try the above methods and let me know if you have any improvements and suggestions

Thanks !!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

amansrivastava17 / bns-short-text-similarity

Programming Languages

Labels

Projects that are alternatives of or similar to bns-short-text-similarity

BNS Vectorizer - Improved TF-IDF for shorter text

Why BNS Better than TF-IDF?

Formula to calculate BNS:

Usage:

Create BNS Vectorizer

Calculate Cosine similarity