Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → scrapinghub → Aile

scrapinghub / Aile

Licence: mit

Automatic Item List Extraction

Labels

html data-science

Projects that are alternatives of or similar to Aile

Phormatics

Using A.I. and computer vision to build a virtual personal fitness trainer. (Most Startup-Viable Hack - HackNYU2018)

Stars: ✭ 79 (-7.06%)

Mutual labels: data-science

Flyte

Accelerate your ML and Data workflows to production. Flyte is a production grade orchestration system for your Data and ML workloads. It has been battle tested at Lyft, Spotify, freenome and others and truly open-source.

Stars: ✭ 1,242 (+1361.18%)

Mutual labels: data-science

Jupytemplate

Templates for jupyter notebooks

Stars: ✭ 85 (+0%)

Mutual labels: data-science

Learn machine learning

Road to Machine Learning

Stars: ✭ 81 (-4.71%)

Mutual labels: data-science

Malwaredatascience

Malware Data Science Reading Diary / Notes

Stars: ✭ 82 (-3.53%)

Mutual labels: data-science

Conferences

List of Machine Learning & Data Science Conferences

Stars: ✭ 83 (-2.35%)

Mutual labels: data-science

Sayn

Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).

Stars: ✭ 79 (-7.06%)

Mutual labels: data-science

Knet.jl

Koç University deep learning framework.

Stars: ✭ 1,260 (+1382.35%)

Mutual labels: data-science

Databench

Data analysis tool.

Stars: ✭ 82 (-3.53%)

Mutual labels: data-science

Maze

Maze Applied Reinforcement Learning Framework

Stars: ✭ 85 (+0%)

Mutual labels: data-science

Gopup

数据接口：百度、谷歌、头条、微博指数,宏观数据，利率数据，货币汇率，千里马、独角兽公司，新闻联播文字稿，影视票房数据，高校名单，疫情数据…

Stars: ✭ 1,229 (+1345.88%)

Mutual labels: data-science

Openml R

R package to interface with OpenML

Stars: ✭ 81 (-4.71%)

Mutual labels: data-science

Xcessiv

A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling in Python.

Stars: ✭ 1,255 (+1376.47%)

Mutual labels: data-science

Pydepta

A python implementation of DEPTA

Stars: ✭ 79 (-7.06%)

Mutual labels: data-science

R Text Data

List of textual data sources to be used for text mining in R

Stars: ✭ 85 (+0%)

Mutual labels: data-science

Setl

A simple Spark-powered ETL framework that just works 🍺

Stars: ✭ 79 (-7.06%)

Mutual labels: data-science

Dltk

Deep Learning Toolkit for Medical Image Analysis

Stars: ✭ 1,249 (+1369.41%)

Mutual labels: data-science

Topic Modeling Tool

A point-and-click tool for creating and analyzing topic models produced by MALLET.

Stars: ✭ 85 (+0%)

Mutual labels: data-science

Pymrmr

Python3 binding to mRMR Feature Selection algorithm (currently not maintained)

Stars: ✭ 85 (+0%)

Mutual labels: data-science

Sortingalgorithm.hayateshiki

Hayate-Shiki is an improved merge sort algorithm with the goal of "faster than quick sort".

Stars: ✭ 84 (-1.18%)

Mutual labels: data-science

View All Similar Projects ➔

Automatic Item List Extraction

This repository is a temporary container for experiments in automatic extraction of list and tables from web pages. At some later point I will merge the surviving algorithms either in scrapely or portia.

I document my ideas and algorithms descriptions at readthedocs.

The current approach is based on the HTML code of the page, treated as a stream of HTML tags as processed by scrapely. An alternative approach would be to use also the web page rendering information (this script renders a tree of bounding boxes for each element).

Installation

pip install -r requirements.txt
python setup.py develop

Running

If you want to have a feeling of how it works there are two demo scripts included in the repo.

demo1.py Will annotate the HTML code of a web page, marking as red the lines that form part of the repeating item and with a prefix number the field number inside the item. The output is written in the file 'annotated.html'.
```
python demo1.py https://news.ycombinator.com
```
demo2.py Will label, color and draw the HTML tree so that repeating elements are easy to see. The output is interactive (requires PyQt4).
```
python demo2.py https://news.ycombinator.com
```

Algorithms

We are trying to auto-detect repeating patterns in the tags, not necessarily made of of li, tr or td tags.

Clustering trees with a measure of similarity

The idea is to compute the distance between all subtrees in the web page and run a clustering algorithm with this distance matrix. For a web page of size N this can be achieved in time O(N^2). The current algorithm actually computes a kernel and from the kernel computes the distance. The algorithm is based on:

Kernels for semi-structured data
Hisashi Kashima, Teruo Koyanagi

Once we compute the distance between all subtrees of the web page DBSCAN clustering is run using the distance matrix. The result is massaged a little more until you get the result.

Markov models

The problem of detecting repeating patterns in streams is known as motif discovery and most of the literature about it seems to be published in the field of genetics. Inspired from this there is a branch (MEME and Profile HMM algorithms).

The Markov model approach has the following problems right now:

Requires several web pages for training, depending on the web page type
Training is performed using EM algorithm which requires several attempts until a good optimum is achieved
The number of hidden states is hard to determine. There are some heuristics applied that work partially

These problems are not unsurmountable (I think) but require a lot of work:

Precision could be improved using conditional random fields. These could alleviate the need for data.
Training can run greatly in parallel. This is actually already done using joblib in a single PC but it could be further improved using a cluster of computers
There are some papers about hidden state merging/splitting and even an infinite number of states

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 85

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (6) 🔗