All Projects → cjymz886 → Sentence Similarity

cjymz886 / Sentence Similarity

Licence: mit
对四种句子/文本相似度计算方法进行实验与比较

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Sentence Similarity

Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-19.34%)
Mutual labels:  word2vec
Webvectors
Web-ify your word2vec: framework to serve distributional semantic models online
Stars: ✭ 154 (-14.92%)
Mutual labels:  word2vec
Wordvectors
Pre-trained word vectors of 30+ languages
Stars: ✭ 2,043 (+1028.73%)
Mutual labels:  word2vec
Fasttext4j
Implementing Facebook's FastText with java
Stars: ✭ 148 (-18.23%)
Mutual labels:  word2vec
Graphwavemachine
A scalable implementation of "Learning Structural Node Embeddings Via Diffusion Wavelets (KDD 2018)".
Stars: ✭ 151 (-16.57%)
Mutual labels:  word2vec
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+6951.38%)
Mutual labels:  word2vec
Nlp research
NLP research:基于tensorflow的nlp深度学习项目,支持文本分类/句子匹配/序列标注/文本生成 四大任务
Stars: ✭ 141 (-22.1%)
Mutual labels:  word2vec
Splitter
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).
Stars: ✭ 177 (-2.21%)
Mutual labels:  word2vec
Skip Gram Pytorch
A complete pytorch implementation of skip-gram
Stars: ✭ 153 (-15.47%)
Mutual labels:  word2vec
Log Anomaly Detector
Log Anomaly Detection - Machine learning to detect abnormal events logs
Stars: ✭ 169 (-6.63%)
Mutual labels:  word2vec
Textfeatures
👷‍♂️ A simple package for extracting useful features from character objects 👷‍♀️
Stars: ✭ 148 (-18.23%)
Mutual labels:  word2vec
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (-16.57%)
Mutual labels:  word2vec
Entity2rec
entity2rec generates item recommendation using property-specific knowledge graph embeddings
Stars: ✭ 159 (-12.15%)
Mutual labels:  word2vec
Skip Thoughts.torch
Porting of Skip-Thoughts pretrained models from Theano to PyTorch & Torch7
Stars: ✭ 146 (-19.34%)
Mutual labels:  word2vec
Deep Math Machine Learning.ai
A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.
Stars: ✭ 173 (-4.42%)
Mutual labels:  word2vec
Word2vec
Go library for performing computations in word2vec binary models
Stars: ✭ 143 (-20.99%)
Mutual labels:  word2vec
Text2vec
text2vec, chinese text to vetor.(文本向量化表示工具,包括词向量化、句子向量化、句子相似度计算)
Stars: ✭ 155 (-14.36%)
Mutual labels:  word2vec
Tensorflow Tutorials
텐서플로우를 기초부터 응용까지 단계별로 연습할 수 있는 소스 코드를 제공합니다
Stars: ✭ 2,096 (+1058.01%)
Mutual labels:  word2vec
Debiaswe
Remove problematic gender bias from word embeddings.
Stars: ✭ 175 (-3.31%)
Mutual labels:  word2vec
Danmf
A sparsity aware implementation of "Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection" (CIKM 2018).
Stars: ✭ 161 (-11.05%)
Mutual labels:  word2vec

sentence-similarity

对四种句子/文本相似度计算方法进行实验与比较;
四种方法为:cosine,cosine+idf,bm25,jaccard;
本实验仍然利用之前抓取的医疗语料库;

1 环境

python3
gensim
jieba
scipy
numpy

2 算法原理

image
image
image
image

3 运行步骤

setp1:先利用word2vec对./data/file_corpus进行词向量训练(python train_word2vec.py),生成voc.txt词向量文件
setp2:对训练出来的词,计算其在语料库中idf词(python compute_idf.py),生成idf.txt文件
setp3:统计语料库中存在的句子(python get_sentence.py),生成file_sentece.txt文件;考虑计算量问题,本实验只取了出现频率最高的前10000个句子
setp4:运行python test.py,可对设定好的5个句子,按照不同的算法得出最相似的结果

备注说明:./data/medfw.txt文件是我上个项目find-Chinese-medcial-words在同样语料库找出的词文件,本次作为用户词库参与jieba分词;similarity.py文件为四种算法实现的程序,可以调用,不同的环境下只需重新训练词向量和词的idf矩阵;./data/test_result.txt文件是本实验测试结果。

4 测试结果

下面表格是对5个相同的句子进行测试的结果,结果可以看出,cosine+idf方法计算复杂度最大,但效果就我个人来看,此方法相对其他方法结果更精确些;bm25算法对有些句子匹配显得有点偏离,我觉得可能跟调节因子k1,b有关;jaccard方法最为简单,计算也是最快,计算结果带不上语义效果;cosine方法算是最常用方法,它的结果非常依赖Word2vec训练的结果。

image

image

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].