Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → scrapinghub → Page_clustering

scrapinghub / Page_clustering

Licence: other

A simple algorithm for clustering web pages, suitable for crawlers

Labels

html data-science

Projects that are alternatives of or similar to Page clustering

Awesome Google Colab

Google Colaboratory Notebooks and Repositories (by @firmai)

Stars: ✭ 863 (+2776.67%)

Mutual labels: data-science

Ethereumdb

Stars: ✭ 21 (-30%)

Mutual labels: data-science

Rebate

Relief Based Algorithms of ReBATE implemented in Python with Cython optimization. This repository is no longer being updated. Please see scikit-rebate.

Stars: ✭ 29 (-3.33%)

Mutual labels: data-science

Pydata.kr

PyData Korea 공식 홈페이지입니다. (준비중)

Stars: ✭ 13 (-56.67%)

Mutual labels: data-science

Clevercsv

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

Stars: ✭ 887 (+2856.67%)

Mutual labels: data-science

Intro Python

Python pour Statistique et Science des Données -- Syntaxe, Trafic de Données, Graphes, Programmation, Apprentissage

Stars: ✭ 21 (-30%)

Mutual labels: data-science

Dataflowjavasdk

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

Stars: ✭ 854 (+2746.67%)

Mutual labels: data-science

Arcgis Python Api

Documentation and samples for ArcGIS API for Python

Stars: ✭ 954 (+3080%)

Mutual labels: data-science

Crime Analysis

Association Rule Mining from Spatial Data for Crime Analysis

Stars: ✭ 20 (-33.33%)

Mutual labels: data-science

Mlnet Workshop

ML.NET Workshop to predict car sales prices

Stars: ✭ 29 (-3.33%)

Mutual labels: data-science

Bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.

Stars: ✭ 877 (+2823.33%)

Mutual labels: data-science

Pydataset

Instant access to many datasets in Python.

Stars: ✭ 880 (+2833.33%)

Mutual labels: data-science

Machine Learning Open Source

Monthly Series - Machine Learning Top 10 Open Source Projects

Stars: ✭ 943 (+3043.33%)

Mutual labels: data-science

Data Science On Gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

Stars: ✭ 864 (+2780%)

Mutual labels: data-science

Python for ml

brief introduction to Python for machine learning

Stars: ✭ 29 (-3.33%)

Mutual labels: data-science

Scanpy

Single-Cell Analysis in Python. Scales to >1M cells.

Stars: ✭ 858 (+2760%)

Mutual labels: data-science

Steppy Toolkit

Curated set of transformers that make your work with steppy faster and more effective 🔭

Stars: ✭ 21 (-30%)

Mutual labels: data-science

Docker Iocaml Datascience

Dockerfile of Jupyter (IPython notebook) and IOCaml (OCaml kernel) with libraries for data science and machine learning

Stars: ✭ 30 (+0%)

Mutual labels: data-science

Wolfram Coronavirus

Wolfram Language code and notebooks related to the coronavirus outbreak

Stars: ✭ 30 (+0%)

Mutual labels: data-science

Workshop

课题组每周研讨会

Stars: ✭ 28 (-6.67%)

Mutual labels: data-science

View All Similar Projects ➔

Description

A simple algorithm for clustering web pages. A wrapper around KMeans. Web pages are converted to vectors, where each vector entry is just the count of a given tag and class attribute. The dimension of the vectors will change as new pages with new tags or class attributes arrive. Also a simple outlier detection is available and enabled by default. This allows for rejecting web pages that are highly improbable to belong to any cluster.

Install

pip install page_clustering

Usage

import page_clustering

clt = page_clustering.OnlineKMeans(n_clusters=5)
# `pages` must have been obtained somehow
for page in pages:
    clt.add_page(page)
y = clt.classify(new_page)
for page in more_pages:
    clt.add_page(page)
y = clt.classify(yet_another_page)

Demo

wget -r --quota=5M https://news.ycombinator.com
python demo.py news.ycombinator.com

Tests

cd tests
py.test

Algorithm

The first part, vectorization, transforms the web page to a vector. For example, take the following page:

<html>
<body>
<ul class="list1">
    <li>A</li>
	<li>B</li>
</ul>
<ul class="list2">
    <li>Y</li>
	<li>Z</li>
</ul>
</body>
</html>

Each non-closing (tag, class) pair is mapped to a vector position and the number of times it appears in the document is the value of the vector at that position.

tag, class	position	count
html	0	1
body	1	1
ul, list1	2	1
li	3	4
ul, list2	4	1

The vector is therefore [1, 1, 1, 4, 1]. This vector is normalized so that it's elements sum up to 1 and the final frequency vector is: [0.125, 0.125, 0.125, 0.5, 0.125]

When a new page arrives it can be possible that new (tag, class) pairs appear. For example imagine that this new page arrives:

<html>
<body>
<p>Another page with a paragraph tag </p>
</body>
</html>

The new page would be mapped according to this table:

tag, class	position	count
html	0	1
body	1	1
ul, list1	2	0
li	3	0
ul, list2	4	0
p	5	1

The vector for this page would be [1, 1, 0, 0, 0, 1], and with normalization: [0.33, 0.33, 0, 0, 0, 0.33].

The new vector has 6 dimensions, this means that the previous page vector needs to be extended accordingly with zeros to the right: [0.125, 0.125, 0.125, 0.5, 0.125, 0].

Once all needed pages are vectorized, KMeans is applied.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 30

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗