All Projects → binux → Pyspider

binux / Pyspider

Licence: apache-2.0
A Powerful Spider(Web Crawler) System in Python.

Programming Languages

python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
HTML
75241 projects
CSS
56736 projects

Labels

Projects that are alternatives of or similar to Pyspider

Sitemap Generator Cli
Creates an XML-Sitemap by crawling a given site.
Stars: ✭ 214 (-98.6%)
Mutual labels:  crawler
Annie
👾 Fast and simple video download library and CLI tool written in Go
Stars: ✭ 16,369 (+7.4%)
Mutual labels:  crawler
Strong Web Crawler
基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。
Stars: ✭ 238 (-98.44%)
Mutual labels:  crawler
Chromium for spider
dynamic crawler for web vulnerability scanner
Stars: ✭ 220 (-98.56%)
Mutual labels:  crawler
Selenops
A Swift Web Crawler 🕷
Stars: ✭ 225 (-98.52%)
Mutual labels:  crawler
Filesensor
Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具
Stars: ✭ 227 (-98.51%)
Mutual labels:  crawler
Webvideobot
Web crawler.
Stars: ✭ 214 (-98.6%)
Mutual labels:  crawler
Magic google
Google search results crawler, get google search results that you need
Stars: ✭ 247 (-98.38%)
Mutual labels:  crawler
Laravel Crawler Detect
A Laravel wrapper for CrawlerDetect - the web crawler detection library
Stars: ✭ 227 (-98.51%)
Mutual labels:  crawler
Ppspider
web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案
Stars: ✭ 237 (-98.44%)
Mutual labels:  crawler
Ruiji.net
crawler framework, distributed crawler extractor
Stars: ✭ 220 (-98.56%)
Mutual labels:  crawler
Arachnid
Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
Stars: ✭ 224 (-98.53%)
Mutual labels:  crawler
Ecommercecrawlers
码云仓库链接:AJay13/ECommerceCrawlers Github 仓库链接:DropsDevopsOrg/ECommerceCrawlers 项目展示平台链接:http://wechat.doonsec.com
Stars: ✭ 3,073 (-79.84%)
Mutual labels:  crawler
Pychromeless
Python Lambda Chrome Automation (naming pending)
Stars: ✭ 219 (-98.56%)
Mutual labels:  crawler
Fast Lianjia Crawler
直接通过链家 API 抓取数据的极速爬虫,宇宙最快~~ 🚀
Stars: ✭ 247 (-98.38%)
Mutual labels:  crawler
Jd mask robot
京东口罩库存监控爬虫(非selenium),扫码登录、查价、加购、下单、秒杀
Stars: ✭ 216 (-98.58%)
Mutual labels:  crawler
Awesome Java Crawler
本仓库收集整理爬虫相关资源,开发语言以Java为主
Stars: ✭ 228 (-98.5%)
Mutual labels:  crawler
Polite
Be nice on the web
Stars: ✭ 253 (-98.34%)
Mutual labels:  crawler
Weibopicdownloader
免登录下载微博图片 爬虫 Download Weibo Images without Logging-in
Stars: ✭ 247 (-98.38%)
Mutual labels:  crawler
Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (-98.48%)
Mutual labels:  crawler

pyspider Build Status Coverage Status

A Powerful Spider(Web Crawler) System in Python.

  • Write script in Python
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
  • RabbitMQ, Redis and Kombu as message queue
  • Task priority, retry, periodical, recrawl by age, etc...
  • Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Installation

WARNING: WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or enable need-auth for webui.

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

TODO

v0.4.0

  • a visual scraping interface like portia

License

Licensed under the Apache License, Version 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].