Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → lorien → Crawler

lorien / Crawler

Web Scraping Framework

Labels

html

======= Crawler

.. image:: https://travis-ci.org/lorien/crawler.png?branch=master :target: https://travis-ci.org/lorien/crawler

.. image:: https://coveralls.io/repos/lorien/crawler/badge.svg?branch=master :target: https://coveralls.io/r/lorien/crawler?branch=master

This project is about:

100% test coverage
Speed
Reasonable memory consumption
SOCKS proxy support
Runtime metrics API

Usage Example

.. code:: python

from urllib.parse import urlsplit

from crawler import Crawler, Request


class TestCrawler(Crawler):

    def task_generator(self):
        for host in ('yandex.ru', 'github.com'):
            yield Request('page', 'https://%s/' % host, meta={'host': host})

    def handler_page(self, req, res):
        title = res.xpath('//title').text(default='N/A')
        print('Title of [%s]: %s' % (req.url, title))

        ext_urls = set()
        for elem in res.xpath('//a[@href]'):
            url = elem.attr('href')
            parts = urlsplit(url)
            if parts.netloc and req.meta['host'] not in parts.netloc:
                ext_urls.add(url)

        print('External URLs:')
        for url in ext_urls:
            print(' * %s' % url)


bot = TestCrawler(num_network_threads=10)
bot.run()

Quick Way to Start New Project

Install crawler with pip install crawler
Run command crawl_start_project <project_name>
cd into new directory
Run make build

That'll build virtualenv with all things you need to start using crawler. To active virtualenv run pipenv shell command.

Put you crawler code into crawlers/ directory. Then run it with command crawl <CrawlerClassName>

Installation

.. code:: bash

pip install crawler

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 50

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗