Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → tenlee2012 → scrapy-kafka-redis

tenlee2012 / scrapy-kafka-redis

Licence: Apache-2.0 license

Distributed crawling/scraping, Kafka And Redis based components for Scrapy

Programming Languages

139335 projects - #7 most used programming language

Labels

redis crawler kafka distributed scrapy

Projects that are alternatives of or similar to scrapy-kafka-redis

Redis-based components for Scrapy.

Stars: ✭ 4,998 (+11006.67%)

Mutual labels: distributed, scrapy

NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider

Stars: ✭ 88 (+95.56%)

Mutual labels: distributed, scrapy

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

Stars: ✭ 921 (+1946.67%)

Mutual labels: distributed, scrapy

💖 High available distributed ip proxy pool, powerd by Scrapy and Redis

Stars: ✭ 4,993 (+10995.56%)

Mutual labels: distributed, scrapy

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Stars: ✭ 2,601 (+5680%)

Mutual labels: distributed, scrapy

Alpha implementation of the Sprawl distributed marketplace protocol.

Stars: ✭ 27 (-40%)

Mutual labels: distributed

blockchain-hackathon

An electronic health record (EHR) system built on Hyperledger Composer blockchain

Stars: ✭ 67 (+48.89%)

Mutual labels: distributed

Task queue, Long lived workers for work based parallelization, with processes and Redis as back-end. For distributed computing.

Stars: ✭ 14 (-68.89%)

Mutual labels: distributed

Distributed Honeypot

Stars: ✭ 54 (+20%)

Mutual labels: distributed

Inventus is a spider designed to find subdomains of a specific domain by crawling it and any subdomains it discovers.

Stars: ✭ 80 (+77.78%)

Mutual labels: scrapy

scrapy-mysql-pipeline

scrapy mysql pipeline

Stars: ✭ 47 (+4.44%)

Mutual labels: scrapy

Common interface for data container classes

Stars: ✭ 47 (+4.44%)

Mutual labels: scrapy

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python

Stars: ✭ 127 (+182.22%)

Mutual labels: distributed

fernando-pessoa

Classificador de poemas do Fernando Pessoa de acordo com os seus heterônimos

Stars: ✭ 31 (-31.11%)

Mutual labels: scrapy

Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).

Stars: ✭ 34 (-24.44%)

Mutual labels: scrapy

Credits(CRDS) - An Evolving Currency For An Evolving Society

Stars: ✭ 14 (-68.89%)

Mutual labels: distributed

Java基于Netty,Protostuff和Zookeeper实现分布式RPC框架

Stars: ✭ 55 (+22.22%)

Mutual labels: distributed

Rust Implementation of Erlang Distribution Protocol

Stars: ✭ 110 (+144.44%)

Mutual labels: distributed

scrapy-html-storage

Scrapy downloader middleware that stores response HTMLs to disk.

Stars: ✭ 17 (-62.22%)

Mutual labels: scrapy

Distributed SQL Engine in Python using Dask

Stars: ✭ 271 (+502.22%)

Mutual labels: distributed

View All Similar Projects ➔

中文文档 | English

Scrpay-Kafka-Redis

In the case of a large number of requests, even using the Bloomfilter algorithm, but using [scrapy-redis] (https://github.com/rmax/scrapy-redis) still consumes a lot of memory. This project refers to scrapy- Redis.

Features

Support for distributed
Use Redis as a deduplication queue, Simultaneous use of Bloomfilter to reduce the memory footprint, but increased the amount of deduplication
Use Kafka as a request queue, Can support a large number of request stacks, capacity and disk size related, rather than running memory
Due to the feature of Kafka, priority queues are not supported, only FIFO queues are supported.

Dependencies

Python 3.0+
Redis >= 2.8
Scrapy >= 1.5
kafka-python >= 1.4.0
kafka <= 1.1.0 (Since kafka-python only supports kafka-1.1.0 version)

How to Use

pip install scrapy-kafka-redis
Configuration settings.py file Must add param in settings.py file

# Enable Kafka scheduling storage request queue
SCHEDULER = "scrapy_kafka_redis.scheduler.Scheduler"

# Use BloomFilter as a deduplication queue
DUPEFILTER_CLASS = "scrapy_kafka_redis.dupefilter.BloomFilter"

Default values for other optional parameters

# the key of the deduplication queue stored in redis
DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'

REDIS_CLS = redis.StrictRedis
REDIS_ENCODING = 'utf-8'
REDIS_URL = 'redis://localhost:6378/1'

REDIS_PARAMS = {
    'socket_timeout': 30,
    'socket_connect_timeout': 30,
    'retry_on_timeout': True,
    'encoding': REDIS_ENCODING,
}

KAFKA_BOOTSTRAP_SERVERS=['localhost:9092']
# Default TOPIC for the dispatch queue
SCHEDULER_QUEUE_TOPIC = '%(spider)s-requests'
# Scheduled queue used by default
SCHEDULER_QUEUE_CLASS = 'scrapy_kafka_redis.queue.KafkaQueue'
# The name of the key stored in the redis queue in redis
SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'
# Deduplication algorithm used by the scheduler
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_kafka_redis.dupefilter.BloomFilter'
# Number of blocks in the BloomFilter algorithm
BLOOM_BLOCK_NUM = 1

# TOPIC used by start urls
START_URLS_TOPIC = '%(name)s-start_urls'

KAFKA_BOOTSTRAP_SERVERS = None
# Kafka producer constructing the request queue
KAFKA_REQUEST_PRODUCER_PARAMS = {
    'api_version': (0, 10, 1),
    'value_serializer': dumps
}
# Constructing a Kafka consumer of the request queue
KAFKA_REQUEST_CONSUMER_PARAMS = {
    'api_version': (0, 10, 1),
    'value_deserializer': loads
}
# Constructing a Kafka consumer in the start queue
KAFKA_START_URLS_CONSUMER_PARAMS = {
    'api_version': (0, 10, 1),
    'value_deserializer': lambda m: m.decode('utf-8'),
}

how to use in spiders

import scrapy
from scrapy_kafka_redis.spiders import KafkaSpider

class DemoSpider(KafkaSpider):
    name = "demo"
    def parse(self, response):
        pass

Create Kafka Topic Set the number of partitions for the topic based on the distributed scrapy instance you need to create.

./bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 3 --replication-factor 1 --topic demo-start_urls

./bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 3 --replication-factor 1 --topic demo-requests

Send Msg

./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic demo-start_urls

It is recommended to manually create a Topic and specify the number of partitions.

run scrapy

Reference:

scrapy-redis Bloomfilter

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 45

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗