All Projects → scrapinghub → webpager

scrapinghub / webpager

Licence: other
Paginating the web

Programming Languages

c
50402 projects - #5 most used programming language
python
139335 projects - #7 most used programming language

Webpager

A simple library to classify if an anchor on HTML page is a pagination link or not.

Installation

Clone the repository, then install package requirements (package requires lxml, scikit-learn):

$ pip install -r requirements.txt

then install package itself:

$ python setup.py install

Usage

Get a HTML page somewhere.:

>>> from urllib import urlopen
>>> url = 'http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-Trattoria_Caffe_Monteverdi-Hong_Kong.html'
>>> html = urlopen(url).read()

Load web pager and classify.:

>>> from webpager import WebPager
>>> webpager = WebPager()
>>> for anchor, label in webpager.paginate(html, url):
>>>     if label:
>>>          print anchor.get('href')

http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or10-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS
http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or40-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS
http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or10-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS

Training

see train.ipynb for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].