All Projects → tenlee2012 → scrapy-kafka-redis

tenlee2012 / scrapy-kafka-redis

Licence: Apache-2.0 license
Distributed crawling/scraping, Kafka And Redis based components for Scrapy

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to scrapy-kafka-redis

Scrapy Redis
Redis-based components for Scrapy.
Stars: ✭ 4,998 (+11006.67%)
Mutual labels:  distributed, scrapy
NScrapy
NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider
Stars: ✭ 88 (+95.56%)
Mutual labels:  distributed, scrapy
Scrapy Cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Stars: ✭ 921 (+1946.67%)
Mutual labels:  distributed, scrapy
Haipproxy
💖 High available distributed ip proxy pool, powerd by Scrapy and Redis
Stars: ✭ 4,993 (+10995.56%)
Mutual labels:  distributed, scrapy
Gerapy
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
Stars: ✭ 2,601 (+5680%)
Mutual labels:  distributed, scrapy
sprawl
Alpha implementation of the Sprawl distributed marketplace protocol.
Stars: ✭ 27 (-40%)
Mutual labels:  distributed
blockchain-hackathon
An electronic health record (EHR) system built on Hyperledger Composer blockchain
Stars: ✭ 67 (+48.89%)
Mutual labels:  distributed
meesee
Task queue, Long lived workers for work based parallelization, with processes and Redis as back-end. For distributed computing.
Stars: ✭ 14 (-68.89%)
Mutual labels:  distributed
DemonHunter
Distributed Honeypot
Stars: ✭ 54 (+20%)
Mutual labels:  distributed
Inventus
Inventus is a spider designed to find subdomains of a specific domain by crawling it and any subdomains it discovers.
Stars: ✭ 80 (+77.78%)
Mutual labels:  scrapy
scrapy-mysql-pipeline
scrapy mysql pipeline
Stars: ✭ 47 (+4.44%)
Mutual labels:  scrapy
itemadapter
Common interface for data container classes
Stars: ✭ 47 (+4.44%)
Mutual labels:  scrapy
heat
Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
Stars: ✭ 127 (+182.22%)
Mutual labels:  distributed
fernando-pessoa
Classificador de poemas do Fernando Pessoa de acordo com os seus heterônimos
Stars: ✭ 31 (-31.11%)
Mutual labels:  scrapy
ArticleSpider
Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).
Stars: ✭ 34 (-24.44%)
Mutual labels:  scrapy
Credits
Credits(CRDS) - An Evolving Currency For An Evolving Society
Stars: ✭ 14 (-68.89%)
Mutual labels:  distributed
toy-rpc
Java基于Netty,Protostuff和Zookeeper实现分布式RPC框架
Stars: ✭ 55 (+22.22%)
Mutual labels:  distributed
erl dist
Rust Implementation of Erlang Distribution Protocol
Stars: ✭ 110 (+144.44%)
Mutual labels:  distributed
scrapy-html-storage
Scrapy downloader middleware that stores response HTMLs to disk.
Stars: ✭ 17 (-62.22%)
Mutual labels:  scrapy
dask-sql
Distributed SQL Engine in Python using Dask
Stars: ✭ 271 (+502.22%)
Mutual labels:  distributed

中文文档 | English

Scrpay-Kafka-Redis

In the case of a large number of requests, even using the Bloomfilter algorithm, but using [scrapy-redis] (https://github.com/rmax/scrapy-redis) still consumes a lot of memory. This project refers to scrapy- Redis.

Features

  • Support for distributed
  • Use Redis as a deduplication queue, Simultaneous use of Bloomfilter to reduce the memory footprint, but increased the amount of deduplication
  • Use Kafka as a request queue, Can support a large number of request stacks, capacity and disk size related, rather than running memory
  • Due to the feature of Kafka, priority queues are not supported, only FIFO queues are supported.

Dependencies

  • Python 3.0+
  • Redis >= 2.8
  • Scrapy >= 1.5
  • kafka-python >= 1.4.0
  • kafka <= 1.1.0 (Since kafka-python only supports kafka-1.1.0 version)

How to Use

  • pip install scrapy-kafka-redis
  • Configuration settings.py file Must add param in settings.py file
# Enable Kafka scheduling storage request queue
SCHEDULER = "scrapy_kafka_redis.scheduler.Scheduler"

# Use BloomFilter as a deduplication queue
DUPEFILTER_CLASS = "scrapy_kafka_redis.dupefilter.BloomFilter"

Default values for other optional parameters

# the key of the deduplication queue stored in redis
DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'

REDIS_CLS = redis.StrictRedis
REDIS_ENCODING = 'utf-8'
REDIS_URL = 'redis://localhost:6378/1'

REDIS_PARAMS = {
    'socket_timeout': 30,
    'socket_connect_timeout': 30,
    'retry_on_timeout': True,
    'encoding': REDIS_ENCODING,
}

KAFKA_BOOTSTRAP_SERVERS=['localhost:9092']
# Default TOPIC for the dispatch queue
SCHEDULER_QUEUE_TOPIC = '%(spider)s-requests'
# Scheduled queue used by default
SCHEDULER_QUEUE_CLASS = 'scrapy_kafka_redis.queue.KafkaQueue'
# The name of the key stored in the redis queue in redis
SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'
# Deduplication algorithm used by the scheduler
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_kafka_redis.dupefilter.BloomFilter'
# Number of blocks in the BloomFilter algorithm
BLOOM_BLOCK_NUM = 1

# TOPIC used by start urls
START_URLS_TOPIC = '%(name)s-start_urls'

KAFKA_BOOTSTRAP_SERVERS = None
# Kafka producer constructing the request queue
KAFKA_REQUEST_PRODUCER_PARAMS = {
    'api_version': (0, 10, 1),
    'value_serializer': dumps
}
# Constructing a Kafka consumer of the request queue
KAFKA_REQUEST_CONSUMER_PARAMS = {
    'api_version': (0, 10, 1),
    'value_deserializer': loads
}
# Constructing a Kafka consumer in the start queue
KAFKA_START_URLS_CONSUMER_PARAMS = {
    'api_version': (0, 10, 1),
    'value_deserializer': lambda m: m.decode('utf-8'),
}
  • how to use in spiders
import scrapy
from scrapy_kafka_redis.spiders import KafkaSpider

class DemoSpider(KafkaSpider):
    name = "demo"
    def parse(self, response):
        pass
  • Create Kafka Topic Set the number of partitions for the topic based on the distributed scrapy instance you need to create.
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 3 --replication-factor 1 --topic demo-start_urls

./bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 3 --replication-factor 1 --topic demo-requests
  • Send Msg
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic demo-start_urls

It is recommended to manually create a Topic and specify the number of partitions.

  • run scrapy

Reference:

scrapy-redis Bloomfilter

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].