All Projects → groupbwt → scrapy-boilerplate

groupbwt / scrapy-boilerplate

Licence: MIT license
Scrapy project boilerplate done right

Programming Languages

python
139335 projects - #7 most used programming language
typescript
32286 projects
javascript
184084 projects - #8 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to scrapy-boilerplate

scrapy-fieldstats
A Scrapy extension to log items coverage when the spider shuts down
Stars: ✭ 17 (-43.33%)
Mutual labels:  scrapy
JD Spider
👍 京东爬虫(大量注释,对刚入门爬虫者极度友好)
Stars: ✭ 56 (+86.67%)
Mutual labels:  scrapy
python-crawler
爬虫学习仓库,适合零基础的人学习,对新手比较友好
Stars: ✭ 37 (+23.33%)
Mutual labels:  scrapy
small-spider-project
日常爬虫
Stars: ✭ 14 (-53.33%)
Mutual labels:  scrapy
www job com
爬取拉勾、BOSS直聘、智联招聘、51job、赶集招聘、58招聘等职位信息
Stars: ✭ 47 (+56.67%)
Mutual labels:  scrapy
project pjx
Python分布式爬虫打造搜索引擎
Stars: ✭ 42 (+40%)
Mutual labels:  scrapy
Raspagem-de-dados-para-iniciantes
Raspagem de dados para iniciante usando Scrapy e outras libs básicas
Stars: ✭ 113 (+276.67%)
Mutual labels:  scrapy
InstaBot
Simple and friendly Bot for Instagram, using Selenium and Scrapy with Python.
Stars: ✭ 32 (+6.67%)
Mutual labels:  scrapy
ancient chinese
古汉语(文言文)字典-爬取文言文字典网,制作Kindle字典.
Stars: ✭ 48 (+60%)
Mutual labels:  scrapy
Autohome
Using Scrapy to crawl Autohome, storage into MonogDB, simple analysis and NLP coming soon
Stars: ✭ 23 (-23.33%)
Mutual labels:  scrapy
scrapy-cloudflare-middleware
A Scrapy middleware to bypass the CloudFlare's anti-bot protection
Stars: ✭ 84 (+180%)
Mutual labels:  scrapy
ufc fight predictor
UFC bout winner prediction using neural nets.
Stars: ✭ 22 (-26.67%)
Mutual labels:  scrapy
scrapy spider
No description or website provided.
Stars: ✭ 58 (+93.33%)
Mutual labels:  scrapy
Scrapy-SearchEngines
bing、google、baidu搜索引擎爬虫。python3.6 and scrapy
Stars: ✭ 28 (-6.67%)
Mutual labels:  scrapy
NScrapy
NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider
Stars: ✭ 88 (+193.33%)
Mutual labels:  scrapy
easypoi
简单、免费、高效的百度地图poi采集和分析工具。
Stars: ✭ 87 (+190%)
Mutual labels:  scrapy
aioScrapy
基于asyncio与aiohttp的异步协程爬虫框架 欢迎Star
Stars: ✭ 34 (+13.33%)
Mutual labels:  scrapy
invana-bot
A Web Crawler that scrapes using YAML and python code.
Stars: ✭ 30 (+0%)
Mutual labels:  scrapy
devsearch
A web search engine built with Python which uses TF-IDF and PageRank to sort search results.
Stars: ✭ 52 (+73.33%)
Mutual labels:  scrapy
ScrapyProject
Scrapy项目(mysql+mongodb豆瓣top250电影)
Stars: ✭ 18 (-40%)
Mutual labels:  scrapy

scrapy-boilerplate

This is a boilerplate for new Scrapy projects.

The project is a WIP, so expect major changes and additions (mostly latter). Master branch is to be considered as always ready to use, with major changes/features introduced in feature branches.

Features

  • Python 3.6+
  • Poetry for dependency management
  • SQLAlchemy ORM with alembic migrations
  • RabbitMQ integrated via pika
  • configuration via ENV variables and/or .env file
  • single file for each class
  • code generation scripts for classes: spiders, pipelines, etc. (see this section)
  • Black to ensure codestyle consistency (see here)
  • Docker-ready (see here)
  • PM2-ready (see here)
  • supports single-IP/rotating proxy config out of the box (see here)

Installation

To create a new project using this boilerplate, you need to:

  1. Clone the repository.
  2. Run the installation script: ./install.sh
  3. ???
  4. PROFIT!

Usage

The boilerplate comes with some pre-written classes and helper scripts and functions, which are described in this section.

Code generation

There is a scrapy command to generate class files and automatically add imports to __init__ files.

The command is a part of a separate package. The repository contains code of the command and default tempaltes used for generation.

It can be used as follows:

scrapy new spider SampleSpider

The first argument (spider) is a type of class file to be generated, and can be one of the following:

  • command
  • extension
  • item
  • middleware
  • model
  • pipeline
  • spider_middleware
  • spider

The second argument is class name.

Also for pipeline and spider class an option --rabbit can be used to add RabbitMQ connection code to generated source.

Option --item is supported for generating pipelines, which adds an import and type-check for a provided item class to the resulting code.

Docker

The project includes Dockerfiles and docker-compose configuration for running your spiders in containers.

Also, a configuration for default RabbitMQ server is included.

Dockerfiles are located inside the docker subdirectory, and the docker-compose.yml - at the root of the project. You might want to change the CMD of the scrapy container to something more relevant to your project. To do so, edit docker/scrapy/Dockerfile.

Docker-compose takes configuration values from ENV. Environment can also be provided by creating a .env file at the root of the project (see .docker_env.example as a sample). Creating of dotenv for docker is handled in the install.sh script by default.

Black

Black is the uncompromising Python code formatter. It is used in thsi project to ensure code style consistensy in the least intrusive fashion.

Black is included in Pipfile dev-dependencies. A pre-commit hook for running autoformatting is also included, via pre-commit tool. It is installed automatically, if you run install.sh. Otherwise, to use it you need to run pre-commit install in the root project folder after installing pre-commit itself.

PM2

This boilerplate contains a sample PM2 config file along with a bash startup script that sets up all the necessary environment to run scrapy with this process manager.

All you need to do, is copy/edit src/pm2/commands/command_example.sh and change the exec part to the command actually needed to be run, and then create process.json ecosystem file (based on src/pm2/process.example.json) to start the script.

Then, cd to src/pm2 and run pm2 start process.json.

Proxy middleware

A scrapy downloader middleware to use a proxy server is included in src/middlewares/HttpProxyMiddleware.py and is enabled by default. You can use it by providing proxy endpoint with the env variable (or in the .env file) PROXY in the format host:port. Proxy authentication can also be provided in the PROXY_AUTH variable, using the format user:password. If provided, it is encoded as a Basic HTTP Auth and put into Proxy-Authorization header.

A single-endpoint proxy is used by default, assuming usage of rotating proxies service. If you want to provide your own list of proxies, an external package has to be used, as this use-case is not yet covered by this boilerplate.

File and folder structure

This boilerplate offers a more intuitive alternative to Scrapy's default project structure. Here, file/directory structure is more flattened and re-arranged a bit.

  • All scrapy-related code is placed directly in src subdirectory (without any subdirs with project name, contrary to default).
  • All scrapy classes (by default located in items.py, middlewares.py, pipelines.py) are converted to sub-modules, where each class is placed in its own separate file. Nothing else goes into those files. Helper functions/modules can be placed in the helpers module.
  • Configs in scrapy.cfg and settings.py are edited to correspond with these changes.
  • Additional subdirectories are added to contain code, related to working with database (src/database), RabbitMQ (src/rabbitmq), and also the accessory directory src/_templates, that contains templates for code generation (see "new" command)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].