All Projects → shibing624 → Similarity

shibing624 / Similarity

Licence: apache-2.0
similarity:相似度计算工具包,java编写。用于词语、短语、句子、词法分析、情感分析、语义分析等相关的相似度计算。

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Similarity

BertSimilarity
Computing similarity of two sentences with google's BERT algorithm。利用Bert计算句子相似度。语义相似度计算。文本相似度计算。
Stars: ✭ 348 (-54.21%)
Mutual labels:  semantic, similarity
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+48.95%)
Mutual labels:  sentiment, semantic
Native Css
Convert pure CSS to React Style or javascript literal objects.
Stars: ✭ 322 (-57.63%)
Mutual labels:  semantic
Recordlinkage
A toolkit for record linkage and duplicate detection in Python
Stars: ✭ 532 (-30%)
Mutual labels:  similarity
Semantic suma
SuMa++: Efficient LiDAR-based Semantic SLAM (Chen et al IROS 2019)
Stars: ✭ 431 (-43.29%)
Mutual labels:  semantic
Troll
Language sentiment analysis and neural networks... for trolls.
Stars: ✭ 330 (-56.58%)
Mutual labels:  sentiment
Final word similarity
综合了同义词词林扩展版与知网(Hownet)的词语相似度计算方法,词汇覆盖更多、结果更准确。
Stars: ✭ 485 (-36.18%)
Mutual labels:  similarity
Macropodus
自然语言处理工具Macropodus,基于Albert+BiLSTM+CRF深度学习网络架构,中文分词,词性标注,命名实体识别,新词发现,关键词,文本摘要,文本相似度,科学计算器,中文数字阿拉伯数字(罗马数字)转换,中文繁简转换,拼音转换。tookit(tool) of NLP,CWS(chinese word segnment),POS(Part-Of-Speech Tagging),NER(name entity recognition),Find(new words discovery),Keyword(keyword extraction),Summarize(text summarization),Sim(text similarity),Calculate(scientific calculator),Chi2num(chinese number to arabic number)
Stars: ✭ 309 (-59.34%)
Mutual labels:  similarity
Dssim
Image similarity comparison simulating human perception (multiscale SSIM in Rust)
Stars: ✭ 668 (-12.11%)
Mutual labels:  similarity
Svg Screenshots
📸🧩 Browser extension to take scalable, semantic, accessible screenshots of websites in SVG format.
Stars: ✭ 404 (-46.84%)
Mutual labels:  semantic
Standard Version
🏆 Automate versioning and CHANGELOG generation, with semver.org and conventionalcommits.org
Stars: ✭ 5,806 (+663.95%)
Mutual labels:  semantic
Open Semantic Search
Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
Stars: ✭ 386 (-49.21%)
Mutual labels:  semantic
Shiny.semantic
Shiny support for powerful Semantic UI library.
Stars: ✭ 345 (-54.61%)
Mutual labels:  semantic
Multi Human Parsing
🔥🔥Official Repository for Multi-Human-Parsing (MHP)🔥🔥
Stars: ✭ 507 (-33.29%)
Mutual labels:  semantic
Sentimentr
Dictionary based sentiment analysis that considers valence shifters
Stars: ✭ 325 (-57.24%)
Mutual labels:  sentiment
Python String Similarity
A library implementing different string similarity and distance measures using Python.
Stars: ✭ 546 (-28.16%)
Mutual labels:  similarity
Bcdu Net
BCDU-Net : Medical Image Segmentation
Stars: ✭ 314 (-58.68%)
Mutual labels:  semantic
Schema Generator
PHP Model Scaffolding from Schema.org and other RDF vocabularies
Stars: ✭ 379 (-50.13%)
Mutual labels:  semantic
Lidar Bonnetal
Semantic and Instance Segmentation of LiDAR point clouds for autonomous driving
Stars: ✭ 465 (-38.82%)
Mutual labels:  semantic
Visual slam related research
视觉(语义) SLAM 相关研究跟踪
Stars: ✭ 708 (-6.84%)
Mutual labels:  semantic

similarity

用于词语、短语、句子、词法分析、情感分析、语义分析等相关的相似度计算。

similarity是由一系列算法组成的Java版相似度计算工具包,目标是传播自然语言处理中相似度计算方法。similarity具备工具实用、性能高效、架构清晰、语料时新、可自定义的特点。

similarity提供下列功能:

  • 词语相似度计算
  • 词林编码法相似度
  • 汉语语义法相似度
  • 知网词语相似度
  • 字面编辑距离法
  • 短语相似度计算
  • 简单短语相似度
  • 句子相似度计算
  • 词性和词序结合法
  • 编辑距离算法
  • Gregor编辑距离法
  • 优化编辑距离法
  • 文本相似度计算
  • 余弦相似度
  • 编辑距离算法
  • 欧几里得距离
  • Jaccard相似性系数
  • Jaro距离
  • Jaro–Winkler距离
  • 曼哈顿距离
  • SimHash + 汉明距离
  • Sørensen–Dice系数
  • 词法分析
  • xmnlp中文分词
  • 分词词性标注
  • 词频统计
  • 知网义原
  • 义原树
  • 情感分析
  • 正面倾向程度
  • 负面倾向程度
  • 情感倾向性
  • 近似词
  • word2vec

在提供丰富功能的同时,similarity内部模块坚持低耦合、模型坚持惰性加载、词典坚持明文发布,使用方便,帮助用户训练自己的语料。


demo

https://www.borntowin.cn/product/word_emb_sim


Todo

文本相似性度量

  • [x] 关键词匹配(TF-IDF、BM25)
  • [x] 浅层语义匹配(WordEmbed隐语义模型,用word2vec或glove词向量直接累加构造的句向量)
  • [ ] 深度语义匹配模型(DSSM、CLSM、DeepMatch、MatchingFeatures、ARC-II、DeepMind,具体依次参考下面的Reference)

欢迎大家贡献代码及思路,完善本项目


jar包

下载其中一个,置于项目Libraries下,这样加入到项目依赖即可。

由于maven官方库包上传需要审核校对,着实耗时,现提供离线版jar包,方便使用。后续可以切换到maven官方库调用。
  • Maven官方库(未上传,暂不可用)
<dependency>
  <groupId>io.github.shibing624</groupId>
  <artifactId>similarity</artifactId>
  <version>1.1.3</version>
</dependency>

import


import org.xm.Similarity;
import org.xm.tendency.word.HownetWordTendency;

public class demo {
    public static void main(String[] args) {
        double result = Similarity.cilinSimilarity("电动车", "自行车");
        System.out.println(result);

        String word = "混蛋";
        HownetWordTendency hownetWordTendency = new HownetWordTendency();
        result = hownetWordTendency.getTendency(word);
        System.out.println(word + "  词语情感趋势值:" + result);
    }
}


Usage

word similarity

public static void main(String[] args) {
    String word1 = "教师";
    String word2 = "教授";
    double cilinSimilarityResult = Similarity.cilinSimilarity(word1, word2);
    double pinyinSimilarityResult = Similarity.pinyinSimilarity(word1, word2);
    double conceptSimilarityResult = Similarity.conceptSimilarity(word1, word2);
    double charBasedSimilarityResult = Similarity.charBasedSimilarity(word1, word2);

    System.out.println(word1 + " vs " + word2 + " 词林相似度值:" + cilinSimilarityResult);
    System.out.println(word1 + " vs " + word2 + " 拼音相似度值:" + pinyinSimilarityResult);
    System.out.println(word1 + " vs " + word2 + " 概念相似度值:" + conceptSimilarityResult);
    System.out.println(word1 + " vs " + word2 + " 字面相似度值:" + charBasedSimilarityResult);
}
    

demo code position: test/java/org.xm/WordSimilarityDemo.java

  • result:

word_sim result

phrase similarity

public static void main(String[] args) {
    String phrase1 = "继续努力";
    String phrase2 = "持续发展";
    double result = Similarity.phraseSimilarity(phrase1, phrase2);

    System.out.println(phrase1 + " vs " + phrase2 + " 短语相似度值:" + result);
}

demo code position: test/java/org.xm/PhraseSimilarityDemo.java

  • result:

phrase sim result

sentence similarity

public static void main(String[] args) {
    String sentence1 = "中国人爱吃鱼";
    String sentence2 = "湖北佬最喜吃鱼";

    double morphoSimilarityResult = Similarity.morphoSimilarity(sentence1, sentence2);
    double editDistanceResult = Similarity.editDistanceSimilarity(sentence1, sentence2);
    double standEditDistanceResult = Similarity.standardEditDistanceSimilarity(sentence1,sentence2);
    double gregeorEditDistanceResult = Similarity.gregorEditDistanceSimilarity(sentence1,sentence2);

    System.out.println(sentence1 + " vs " + sentence2 + " 词形词序句子相似度值:" + morphoSimilarityResult);
    System.out.println(sentence1 + " vs " + sentence2 + " 优化的编辑距离句子相似度值:" + editDistanceResult);
    System.out.println(sentence1 + " vs " + sentence2 + " 标准编辑距离句子相似度值:" + standEditDistanceResult);
    System.out.println(sentence1 + " vs " + sentence2 + " gregeor编辑距离句子相似度值:" + gregeorEditDistanceResult);
}

demo code position: test/java/org.xm/SentenceSimilarityDemo.java

  • result:

sentence sim result

text similarity

@Test
public void getSimilarityScore() throws Exception {
    String text1 = "我爱购物";
    String text2 = "我爱读书";
    String text3 = "他是黑客";
    TextSimilarity similarity = new CosineSimilarity();
    double score1pk2 = similarity.getSimilarity(text1, text2);
    double score1pk3 = similarity.getSimilarity(text1, text3);
    double score2pk2 = similarity.getSimilarity(text2, text2);
    double score2pk3 = similarity.getSimilarity(text2, text3);
    double score3pk3 = similarity.getSimilarity(text3, text3);
    System.out.println(text1 + " 和 " + text2 + " 的相似度分值:" + score1pk2);
    System.out.println(text1 + " 和 " + text3 + " 的相似度分值:" + score1pk3);
    System.out.println(text2 + " 和 " + text2 + " 的相似度分值:" + score2pk2);
    System.out.println(text2 + " 和 " + text3 + " 的相似度分值:" + score2pk3);
    System.out.println(text3 + " 和 " + text3 + " 的相似度分值:" + score3pk3);

}

demo code position: test/java/org.xm/similarity/text/CosineSimilarityTest.java

  • result:

cos text result

word frequency statistics

demo code position: test/java/org.xm/tokenizer/WordFreqStatisticsTest.java

  • result:

word freq result

分词及词性标注内置调用HanLP,也可以使用我们NLPchina的ansj_seg分词工具。

sentiment analysis based on words

@Test
public void getTendency() throws Exception {
    HownetWordTendency hownet = new HownetWordTendency();
    String word = "美好";
    double sim = hownet.getTendency(word);
    System.out.println(word + ":" + sim);
    System.out.println("混蛋:" + hownet.getTendency("混蛋"));
}

demo code position: test/java/org.xm/tendency.word/HownetWordTendencyTest.java

  • result:

tendency result

本例是基于义原树的词语粒度情感极性分析,关于文本情感分析有text-classifier,利用深度神经网络模型、SVM分类算法实现的效果更好。

homoionym(use word2vec)

@Test
public void testHomoionym() throws Exception {
    List<String> result = Word2vec.getHomoionym(RAW_CORPUS_SPLIT_MODEL, "武功", 10);
    System.out.println("武功 近似词:" + result);
}

@Test
public void testHomoionymName() throws Exception {
    String model = RAW_CORPUS_SPLIT_MODEL;
    List<String> result = Word2vec.getHomoionym(model, "乔帮主", 10);
    System.out.println("乔帮主 近似词:" + result);

    List<String> result2 = Word2vec.getHomoionym(model, "阿朱", 10);
    System.out.println("阿朱 近似词:" + result2);

    List<String> result3 = Word2vec.getHomoionym(model, "少林寺", 10);
    System.out.println("少林寺 近似词:" + result3);
}
    

demo code position: test/java/org.xm/word2vec/Word2vecTest.java

  • train:

word2vec train

  • result:

word2vec result

训练词向量使用的是阿健实现的java版word2vec训练工具Word2VEC_java,训练语料是小说天龙八部,通过词向量实现得到近义词。 用户可以训练自定义语料,也可以用中文维基百科训练通用词向量。

Reference

  • [DSSM] Po-Sen Huang, et al., 2013, Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
  • [CLSM] Yelong Shen, et al, 2014, A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval
  • [DeepMatch] Zhengdong Lu & Hang Li, 2013, A Deep Architecture for Matching Short Texts
  • [MatchingFeatures] Zongcheng Ji, et al., 2014, An Information Retrieval Approach to Short Text Conversation
  • [ARC-II] Baotian Hu, et al., 2015, Convolutional Neural Network Architectures for Matching Natural Language Sentences
  • [DeepMind] Aliaksei Severyn, et al., 2015, Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].