All Projects → TeamHG-Memex → scrapy-kafka-export

TeamHG-Memex / scrapy-kafka-export

Licence: MIT License
Scrapy extension which writes crawled items to Kafka

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

scrapy-kafka-export

PyPI Version Build Status Code Coverage

scrapy-kafka-export package provides a Scrapy extension to export items to Kafka.

License is MIT.

Extension requires Python 2.7 or 3.4+.

Install

pip install scrapy-kafka-export

Usage

To use KafkaItemExporterExtension, enable and configure it in settings.py:

EXTENSIONS = {
    'scrapy_kafka_export.KafkaItemExporterExtension': 1,
}
KAFKA_EXPORT_ENABLED = True
KAFKA_BROKERS = [
    'kafka1:9093',
    'kafka2:9093',
    'kafka3:9093'
]
KAFKA_TOPIC = 'test-topic'

After that all scraped items would be put to a Kafka topic. If an item has an _id field, _id is used as a message key.

SSL-based auth

If your Kafka uses SSL, configure SSL-based auth:

KAFKA_SSL_CONFIG_MODULE = 'myproject'
KAFKA_SSL_CACERT_FILE = 'certificates/ca-cert.pem'
KAFKA_SSL_CLIENTCERT_FILE = 'certificates/client-cert.pem'
KAFKA_SSL_CLIENTKEY_FILE = 'certificates/client-key.pem'

Assuming the following structure for the certificates from the project 'myproject':

myproject_repo/
myproject_repo/myproject/
myproject_repo/myproject/__init_.py
myproject_repo/myproject/certificates/ca-cert.pem
myproject_repo/myproject/certificates/myproject-client-cert.pem
myproject_repo/myproject/certificates/myproject-client-key.pem
...

If you're using setup.py to deploy the project (using scrapyd or Scrapy Cloud), certificates should be added to package data. Modify setup.py like this:

from setuptools import setup, find_packages

setup(
    name = 'myproject',
    ...
    package_data = {
        'myproject': ['certificates/*.pem'],
    },
    ...
)

Settings

  • KAFKA_EXPORT_ENABLED - Flag that enables the extension; it is False by default.
  • KAFKA_BROKERS - List of Kafka brokers in format host:port
  • KAFKA_TOPIC - Kafka topic where items are going to be sent
  • KAFKA_BATCH_SIZE - Kafka batch size (100 by default).
  • KAFKA_SSL_CONFIG_MODULE - name of the project module
  • KAFKA_SSL_CACERT_FILE - resource path of the Certificate Authority certificate
  • KAFKA_SSL_CLIENTCERT_FILE - resource path of the client certificate
  • KAFKA_SSL_CLIENTKEY_FILE - resource path of the client key

If KAFKA_SSL_CONFIG_MODULE is not set, no certificate will be loaded.

Writer

If you want to push Scrapy items to Kafka from a script, instead of using scrapy_kafka_export.KafkaItemExporterExtension use scrapy_kafka_export.writer.ScrapyKafkaTopicWriter; see its docstring for more.


define hyperiongray
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].