All Projects → alash3al → scrapyr

alash3al / scrapyr

Licence: Apache-2.0 License
a simple & tiny scrapy clustering solution, considered a drop-in replacement for scrapyd

Programming Languages

go
31211 projects - #10 most used programming language
HCL
1544 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to scrapyr

SpiderManager
爬虫管理平台
Stars: ✭ 27 (-46%)
Mutual labels:  scrapy
elixir cluster
Distributed Elixir Cluster on Render with libcluster and Mix Releases
Stars: ✭ 15 (-70%)
Mutual labels:  clustering
douban-spider
基于Scrapy框架的豆瓣电影爬虫
Stars: ✭ 25 (-50%)
Mutual labels:  scrapy
BPRMeth
Modelling DNA methylation profiles
Stars: ✭ 18 (-64%)
Mutual labels:  clustering
zio-entity
Zio-Entity, a distributed, high performance, functional event sourcing library
Stars: ✭ 68 (+36%)
Mutual labels:  clustering
revolver
REVOLVER - Repeated Evolution in Cancer
Stars: ✭ 52 (+4%)
Mutual labels:  clustering
allitebooks.com
Download all the ebooks with indexed csv of "allitebooks.com"
Stars: ✭ 24 (-52%)
Mutual labels:  scrapy
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-56%)
Mutual labels:  scrapy
graphgrove
A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search
Stars: ✭ 29 (-42%)
Mutual labels:  clustering
M3C
Monte Carlo Reference-based Consensus Clustering
Stars: ✭ 24 (-52%)
Mutual labels:  clustering
hpdbscan
Highly parallel DBSCAN (HPDBSCAN)
Stars: ✭ 19 (-62%)
Mutual labels:  clustering
Unsupervised-Learning-in-R
Workshop (6 hours): Clustering (Hdbscan, LCA, Hopach), dimension reduction (UMAP, GLRM), and anomaly detection (isolation forests).
Stars: ✭ 34 (-32%)
Mutual labels:  clustering
ptt-web-crawler
PTT 網路版爬蟲
Stars: ✭ 20 (-60%)
Mutual labels:  scrapy
Python Master Courses
人生苦短 我用Python
Stars: ✭ 61 (+22%)
Mutual labels:  scrapy
Clustering-in-Python
Clustering methods in Machine Learning includes both theory and python code of each algorithm. Algorithms include K Mean, K Mode, Hierarchical, DB Scan and Gaussian Mixture Model GMM. Interview questions on clustering are also added in the end.
Stars: ✭ 27 (-46%)
Mutual labels:  clustering
napari-clusters-plotter
A plugin to use with napari for clustering objects according to their properties.
Stars: ✭ 18 (-64%)
Mutual labels:  clustering
scrapy-pipelines
A collection of pipelines for Scrapy
Stars: ✭ 16 (-68%)
Mutual labels:  scrapy
toutiao
今日头条科技新闻接口爬虫
Stars: ✭ 17 (-66%)
Mutual labels:  scrapy
memes-api
API for scrapping common meme sites
Stars: ✭ 17 (-66%)
Mutual labels:  scrapy
atlassian-kubernetes
All things Atlassian and Kubernetes
Stars: ✭ 30 (-40%)
Mutual labels:  clustering

scrapyr

a very simple scrapy orchestrator engine that could be distributed among multiple machines to build a scrapy cluster, under-the-hood it uses redis as a task broker, it may be changed in the future to support pluggable brokers, but for now it does the job.

Features

  • uses simple configuration language for humans called hcl.
  • multiple types of queues/workers (lifo, fifo, weight).
  • you can define multiple workers with different type of queues.
  • abbility to override the content of the settings.py of the scrapy project from the same configuration file.
  • a status endpoint helps you to understand what is going on.
  • a enqueue endpoint lets you push a job into the specified queue, as well the abbility to execute the job instantly and returns the extracted items.

API Examples

  • Getting the status of the cluster
curl --request GET \
  --url http://localhost:1993/status \
  --header 'content-type: application/json'
  • Push a task into the queue utilizing the worker worker1 which is pre-defined in the scrapyr.hcl
# worker -> the worker name (predefined in scrapyr.hcl)
# spider -> the scrapy spider to be executed
# max_execution_time -> the max duration the scrapy process should take
# args -> a key value strings will be translated to `-a key=value ...` for each key-value pair.
# weight -> the weight of the task itself (in case of weight based workers defined in the scrapyr.hcl)
curl --request POST \
  --url http://localhost:1993/enqueue \
  --header 'content-type: application/json' \
  --data '{
	"worker": "worker1",
	"spider": "spider_name",
	"max_execution_time": "20s",
	"args": {
            "scrapy_arg_name": "scrapy_arg_value"
      },
	"weight": 10
}'

Configurations

here is an example of the scraply.hcl

# the webserver listening address
listen_addr = ":1993"

# redis connection string
# it uses url-style connection string
# example: redis://username:password@hostname:port/database_number
redis_dsn = "redis://127.0.0.1:6378/1"

scrapy {
    project_dir = "${HOME}/playground/tstscrapy"

    python_bin = "/usr/bin/python3"

    items_dir = "${PWD}/data"
}

worker worker1 {
    // which method you want the worker to use
    // lifo: last in, first out
    // fifo: first in, first out
    // weight: max weight, first out
    use = "weight"

    // max processes to be executed in the same time for this workers
    max_procs = 5
}


# sometimes you may need to control the `ProjectNAme/ProjectName/settings.py` file from here
# so we did this special key which mounts the contents of it into `settings.py` file.
settings_py = <<PYTHON
# Scrapy settings for tstscrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tstscrapy'

SPIDER_MODULES = ['tstscrapy.spiders']
NEWSPIDER_MODULE = 'tstscrapy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tstscrapy (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'tstscrapy.middlewares.TstscrapySpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'tstscrapy.middlewares.TstscrapyDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'tstscrapy.pipelines.TstscrapyPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

DOWNLOAD_TIMEOUT = 10
PYTHON

Install

you can download the latest binary build from the releases page or by using docker directly.

Contributing

  • Fork the repo
  • Create a feature branch
  • Push your changes
  • Create a pull request

License

Apache License v2.0

Author

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].