Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → cjymz886 → Sentence Similarity

cjymz886 / Sentence Similarity

Licence: mit

对四种句子/文本相似度计算方法进行实验与比较

Programming Languages

python

139335 projects - #7 most used programming language

Labels

word2vec

Projects that are alternatives of or similar to Sentence Similarity

Wordembeddings Elmo Fasttext Word2vec

Using pre trained word embeddings (Fasttext, Word2Vec)

Stars: ✭ 146 (-19.34%)

Mutual labels: word2vec

Webvectors

Web-ify your word2vec: framework to serve distributional semantic models online

Stars: ✭ 154 (-14.92%)

Mutual labels: word2vec

Wordvectors

Pre-trained word vectors of 30+ languages

Stars: ✭ 2,043 (+1028.73%)

Mutual labels: word2vec

Fasttext4j

Implementing Facebook's FastText with java

Stars: ✭ 148 (-18.23%)

Mutual labels: word2vec

Graphwavemachine

A scalable implementation of "Learning Structural Node Embeddings Via Diffusion Wavelets (KDD 2018)".

Stars: ✭ 151 (-16.57%)

Mutual labels: word2vec

Gensim

Topic Modelling for Humans

Stars: ✭ 12,763 (+6951.38%)

Mutual labels: word2vec

Nlp research

NLP research：基于tensorflow的nlp深度学习项目，支持文本分类/句子匹配/序列标注/文本生成四大任务

Stars: ✭ 141 (-22.1%)

Mutual labels: word2vec

Splitter

A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Stars: ✭ 177 (-2.21%)

Mutual labels: word2vec

Skip Gram Pytorch

A complete pytorch implementation of skip-gram

Stars: ✭ 153 (-15.47%)

Mutual labels: word2vec

Log Anomaly Detector

Log Anomaly Detection - Machine learning to detect abnormal events logs

Stars: ✭ 169 (-6.63%)

Mutual labels: word2vec

Textfeatures

👷‍♂️ A simple package for extracting useful features from character objects 👷‍♀️

Stars: ✭ 148 (-18.23%)

Mutual labels: word2vec

Embedding As Service

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques

Stars: ✭ 151 (-16.57%)

Mutual labels: word2vec

Entity2rec

entity2rec generates item recommendation using property-specific knowledge graph embeddings

Stars: ✭ 159 (-12.15%)

Mutual labels: word2vec

Skip Thoughts.torch

Porting of Skip-Thoughts pretrained models from Theano to PyTorch & Torch7

Stars: ✭ 146 (-19.34%)

Mutual labels: word2vec

Deep Math Machine Learning.ai

A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.

Stars: ✭ 173 (-4.42%)

Mutual labels: word2vec

Word2vec

Go library for performing computations in word2vec binary models

Stars: ✭ 143 (-20.99%)

Mutual labels: word2vec

Text2vec

text2vec, chinese text to vetor.(文本向量化表示工具，包括词向量化、句子向量化、句子相似度计算)

Stars: ✭ 155 (-14.36%)

Mutual labels: word2vec

Tensorflow Tutorials

텐서플로우를 기초부터 응용까지 단계별로 연습할 수 있는 소스 코드를 제공합니다

Stars: ✭ 2,096 (+1058.01%)

Mutual labels: word2vec

Debiaswe

Remove problematic gender bias from word embeddings.

Stars: ✭ 175 (-3.31%)

Mutual labels: word2vec

Danmf

A sparsity aware implementation of "Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection" (CIKM 2018).

Stars: ✭ 161 (-11.05%)

Mutual labels: word2vec

View All Similar Projects ➔

sentence-similarity

对四种句子/文本相似度计算方法进行实验与比较；
四种方法为:cosine,cosine+idf,bm25,jaccard；
本实验仍然利用之前抓取的医疗语料库；

1 环境

python3
gensim
jieba
scipy
numpy

2 算法原理

3 运行步骤

setp1:先利用word2vec对./data/file_corpus进行词向量训练(python train_word2vec.py)，生成voc.txt词向量文件
setp2:对训练出来的词，计算其在语料库中idf词(python compute_idf.py),生成idf.txt文件
setp3:统计语料库中存在的句子(python get_sentence.py),生成file_sentece.txt文件;考虑计算量问题，本实验只取了出现频率最高的前10000个句子
setp4：运行python test.py，可对设定好的5个句子，按照不同的算法得出最相似的结果

备注说明：./data/medfw.txt文件是我上个项目find-Chinese-medcial-words在同样语料库找出的词文件，本次作为用户词库参与jieba分词;similarity.py文件为四种算法实现的程序，可以调用，不同的环境下只需重新训练词向量和词的idf矩阵；./data/test_result.txt文件是本实验测试结果。

4 测试结果

下面表格是对5个相同的句子进行测试的结果，结果可以看出，cosine+idf方法计算复杂度最大，但效果就我个人来看，此方法相对其他方法结果更精确些；bm25算法对有些句子匹配显得有点偏离，我觉得可能跟调节因子k1，b有关；jaccard方法最为简单，计算也是最快，计算结果带不上语义效果；cosine方法算是最常用方法，它的结果非常依赖Word2vec训练的结果。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 181

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗