All Projects → rmax → Scrapy Redis

rmax / Scrapy Redis

Licence: mit
Redis-based components for Scrapy.

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to Scrapy Redis

Haipproxy
💖 High available distributed ip proxy pool, powerd by Scrapy and Redis
Stars: ✭ 4,993 (-0.1%)
Mutual labels:  crawler, scrapy, redis, distributed
Spoon
🥄 A package for building specific Proxy Pool for different Sites.
Stars: ✭ 173 (-96.54%)
Mutual labels:  crawler, redis, distributed
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Stars: ✭ 921 (-81.57%)
Mutual labels:  scrapy, redis, distributed
scrapy-kafka-redis
Distributed crawling/scraping, Kafka And Redis based components for Scrapy
Stars: ✭ 45 (-99.1%)
Mutual labels:  distributed, scrapy
Ruiji.net
crawler framework, distributed crawler extractor
Stars: ✭ 220 (-95.6%)
Mutual labels:  crawler, scrapy
Filesensor
Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具
Stars: ✭ 227 (-95.46%)
Mutual labels:  crawler, scrapy
Dotnetspider
DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework
Stars: ✭ 3,233 (-35.31%)
Mutual labels:  crawler, distributed
NScrapy
NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider
Stars: ✭ 88 (-98.24%)
Mutual labels:  distributed, scrapy
Scrapy Crawlera
Crawlera middleware for Scrapy
Stars: ✭ 281 (-94.38%)
Mutual labels:  crawler, scrapy
Summer
这是一个支持分布式和集群的java游戏服务器框架,可用于开发棋牌、回合制等游戏。基于netty实现高性能通讯,支持tcp、http、websocket等协议。支持消息加解密、攻击拦截、黑白名单机制。封装了redis缓存、mysql数据库的连接与使用。轻量级,便于上手。
Stars: ✭ 336 (-93.28%)
Mutual labels:  redis, distributed
Redsync.go
*DEPRECATED* Please use https://gopkg.in/redsync.v1 (https://github.com/go-redsync/redsync)
Stars: ✭ 292 (-94.16%)
Mutual labels:  redis, distributed
Vault
swiss army knife for hackers
Stars: ✭ 346 (-93.08%)
Mutual labels:  crawler, scrapy
Github Spider
Github 仓库及用户分析爬虫
Stars: ✭ 190 (-96.2%)
Mutual labels:  crawler, scrapy
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (-96.2%)
Mutual labels:  crawler, scrapy
Ecommercecrawlers
码云仓库链接:AJay13/ECommerceCrawlers Github 仓库链接:DropsDevopsOrg/ECommerceCrawlers 项目展示平台链接:http://wechat.doonsec.com
Stars: ✭ 3,073 (-38.52%)
Mutual labels:  crawler, scrapy
Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (-96.28%)
Mutual labels:  crawler, scrapy
ptt-web-crawler
PTT 網路版爬蟲
Stars: ✭ 20 (-99.6%)
Mutual labels:  crawler, scrapy
Scrapingoutsourcing
ScrapingOutsourcing专注分享爬虫代码 尽量每周更新一个
Stars: ✭ 164 (-96.72%)
Mutual labels:  crawler, scrapy
Proxy pool
Python爬虫代理IP池(proxy pool)
Stars: ✭ 13,964 (+179.39%)
Mutual labels:  crawler, redis
Redisson
Redisson - Redis Java client with features of In-Memory Data Grid. Over 50 Redis based Java objects and services: Set, Multimap, SortedSet, Map, List, Queue, Deque, Semaphore, Lock, AtomicLong, Map Reduce, Publish / Subscribe, Bloom filter, Spring Cache, Tomcat, Scheduler, JCache API, Hibernate, MyBatis, RPC, local cache ...
Stars: ✭ 17,972 (+259.58%)
Mutual labels:  redis, distributed

Scrapy-Redis

Documentation Status Coverage Status Code Quality Status Requirements Status

Redis-based components for Scrapy.

Features

  • Distributed crawling/scraping

    You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.

  • Distributed post-processing

    Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.

  • Scrapy plug-and-play components

    Scheduler + Duplication Filter, Item Pipeline, Base Spiders.

Note

This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera project.

Requirements

  • Python 2.7, 3.4 or 3.5
  • Redis >= 2.8
  • Scrapy >= 1.1
  • redis-py >= 3.0

Usage

Use the following settings in your project:

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Enables stats shared based on Redis
STATS_CLASS = "scrapy_redis.stats.RedisStatsCollector"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Maximum idle time before close spider.
# When the number of idle seconds is greater than MAX_IDLE_TIME_BEFORE_CLOSE, the crawler will close.
# If 0, the crawler will DontClose forever to wait for the next request.
# If negative number, the crawler will immediately close when the queue is empty, just like Scrapy.
#MAX_IDLE_TIME_BEFORE_CLOSE = 0

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'

# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS  = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False

# If True, it uses redis ``zrevrange`` and ``zremrangebyrank`` operation. You have to use the ``zadd``
# command to add URLS and Scores to redis queue. This could be useful if you
# want to use priority and avoid duplicates in your start urls list.
#REDIS_START_URLS_AS_ZSET = False

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'

Note

Version 0.3 changed the requests serialization from marshal to cPickle, therefore persisted requests using version 0.2 will not able to work on 0.3.

Running the example project

This example illustrates how to share a spider's requests queue across multiple spider instances, highly suitable for broad crawls.

  1. Setup scrapy_redis package in your PYTHONPATH

  2. Run the crawler for first time then stop it:

    $ cd example-project
    $ scrapy crawl dmoz
    ... [dmoz] ...
    ^C
    
  3. Run the crawler again to resume stopped crawling:

    $ scrapy crawl dmoz
    ... [dmoz] DEBUG: Resuming crawl (9019 requests scheduled)
    
  4. Start one or more additional scrapy crawlers:

    $ scrapy crawl dmoz
    ... [dmoz] DEBUG: Resuming crawl (8712 requests scheduled)
    
  5. Start one or more post-processing workers:

    $ python process_items.py dmoz:items -v
    ...
    Processing: Kilani Giftware (http://www.dmoz.org/Computers/Shopping/Gifts/)
    Processing: NinjaGizmos.com (http://www.dmoz.org/Computers/Shopping/Gifts/)
    ...
    

Feeding a Spider from Redis

The class scrapy_redis.spiders.RedisSpider enables a spider to read the urls from redis. The urls in the redis queue will be processed one after another, if the first request yields more requests, the spider will process those requests before fetching another url from redis.

For example, create a file myspider.py with the code below:

from scrapy_redis.spiders import RedisSpider

class MySpider(RedisSpider):
    name = 'myspider'

    def parse(self, response):
        # do stuff
        pass

Then:

  1. run the spider:

    scrapy runspider myspider.py
    
  2. push urls to redis:

    redis-cli lpush myspider:start_urls http://google.com
    

Note

These spiders rely on the spider idle signal to fetch start urls, hence it may have a few seconds of delay between the time you push a new url and the spider starts crawling it.

Contributions

Donate BTC: 13haqimDV7HbGWtz7uC6wP1zvsRWRAhPmF

Donate BCC: CSogMjdfPZnKf1p5ocu3gLR54Pa8M42zZM

Donate ETH: 0x681d9c8a2a3ff0b612ab76564e7dca3f2ccc1c0d

Donate LTC: LaPHpNS1Lns3rhZSvvkauWGDfCmDLKT8vP

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].