All Projects → matiskay → Html Similarity

matiskay / Html Similarity

Licence: bsd-3-clause
Compare html similarity using structural and style metrics

Programming Languages

python
139335 projects - #7 most used programming language
python36
32 projects

Projects that are alternatives of or similar to Html Similarity

mrivis
medical image visualization library and development toolkit
Stars: ✭ 19 (-87.5%)
Mutual labels:  similarity
Recordlinkage
A toolkit for record linkage and duplicate detection in Python
Stars: ✭ 532 (+250%)
Mutual labels:  similarity
Computervision Recipes
Best Practices, code samples, and documentation for Computer Vision.
Stars: ✭ 8,214 (+5303.95%)
Mutual labels:  similarity
ruimtehol
R package to Embed All the Things! using StarSpace
Stars: ✭ 95 (-37.5%)
Mutual labels:  similarity
Macropodus
自然语言处理工具Macropodus,基于Albert+BiLSTM+CRF深度学习网络架构,中文分词,词性标注,命名实体识别,新词发现,关键词,文本摘要,文本相似度,科学计算器,中文数字阿拉伯数字(罗马数字)转换,中文繁简转换,拼音转换。tookit(tool) of NLP,CWS(chinese word segnment),POS(Part-Of-Speech Tagging),NER(name entity recognition),Find(new words discovery),Keyword(keyword extraction),Summarize(text summarization),Sim(text similarity),Calculate(scientific calculator),Chi2num(chinese number to arabic number)
Stars: ✭ 309 (+103.29%)
Mutual labels:  similarity
Dssim
Image similarity comparison simulating human perception (multiscale SSIM in Rust)
Stars: ✭ 668 (+339.47%)
Mutual labels:  similarity
fsimilar
find/file similar
Stars: ✭ 13 (-91.45%)
Mutual labels:  similarity
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+748.68%)
Mutual labels:  similarity
Final word similarity
综合了同义词词林扩展版与知网(Hownet)的词语相似度计算方法,词汇覆盖更多、结果更准确。
Stars: ✭ 485 (+219.08%)
Mutual labels:  similarity
Ml Classify Text Js
Machine learning based text classification in JavaScript using n-grams and cosine similarity
Stars: ✭ 38 (-75%)
Mutual labels:  similarity
semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-86.18%)
Mutual labels:  similarity
goodreads-toolbox
9 tools for Goodreads.com, for finding people based on the books they’ve read, finding books popular among the people you follow, following new book reviews, etc
Stars: ✭ 56 (-63.16%)
Mutual labels:  similarity
Similarity
similarity:相似度计算工具包,java编写。用于词语、短语、句子、词法分析、情感分析、语义分析等相关的相似度计算。
Stars: ✭ 760 (+400%)
Mutual labels:  similarity
aurora
Malware similarity platform with modularity in mind.
Stars: ✭ 70 (-53.95%)
Mutual labels:  similarity
Consimilo
A Clojure library for querying large data-sets on similarity
Stars: ✭ 54 (-64.47%)
Mutual labels:  similarity
apollo
Advanced similarity and duplicate source code proof of concept for our research efforts.
Stars: ✭ 49 (-67.76%)
Mutual labels:  similarity
Python String Similarity
A library implementing different string similarity and distance measures using Python.
Stars: ✭ 546 (+259.21%)
Mutual labels:  similarity
Dists
IQA: Deep Image Structure and Texture Similarity Metric
Stars: ✭ 101 (-33.55%)
Mutual labels:  similarity
Rltk
Record Linkage ToolKit (Find and link entities)
Stars: ✭ 71 (-53.29%)
Mutual labels:  similarity
Node Damerau Levenshtein
Damerau - Levenstein distance function for node
Stars: ✭ 27 (-82.24%)
Mutual labels:  similarity

=============== HTML Similarity

.. image:: https://travis-ci.org/matiskay/html-similarity.svg?branch=master :target: https://travis-ci.org/matiskay/html-similarity

.. image:: https://codebeat.co/badges/304915eb-48a3-46a8-9ce9-2790c82dc2b8 :target: https://codebeat.co/projects/github-com-matiskay-html-similarity-master

This package provides a set of functions to measure the similarity between web pages.

Install

The quick way::

pip install html-similarity

How it works?

Structural Similarity

Uses sequence comparison of the html tags to compute the similarity.

We not implement the similarity based on tree edit distance because it is slower than sequence comparison.

Style Similarity

Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.

Joint Similarity (Structural Similarity and Style Similarity)

The joint similarity metric is calculated as::

k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)

All the similarity metrics takes values between 0 and 1.

Recommendations for joint similarity

Using k=0.3 give use better results. The style similarity gives more information about the similarity rather than the structural similarity.

Examples

Here is a example::

In [1]: html_1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
    <li class="active">Documents</li>
    <li>Extra</li>
</ul>
'''

In [2]: html_2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
    <li class="active">Extra Documents</li>
</ul>
'''

In [3] from html_similarity import style_similarity, structural_similarity, similarity

In [4]: style_similarity(html_1, html_2)
Out[4]: 1.0

In [7]: structural_similarity(html_1, html_2)
Out[7]: 0.9090909090909091

In [8]: similarity(html_1, html_2)
Out[8]: 0.9545454545454546

References

  • The idea of sequence comparision was taken from Page Compare <https://github.com/TeamHG-Memex/page-compare>_.
  • The other ideas were taken from T. Gowda and C. A. Mattmann, Clustering Web Pages Based on Structure and Style Similarity, 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA, 2016, pp. 175-180. <http://ieeexplore.ieee.org/document/7785739/>_
  • Use case Clustering web pages based on structure and style similarity <https://www.slideshare.net/thammegowda/ieee-iri-16-clustering-web-pages-based-on-structure-and-style-similarity?qid=7deea5f8-157d-4e57-a413-16ec7c6a22d9&v=&b=&from_search=1>_
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].