All Projects → TeamHG-Memex → Scrapy Rotating Proxies

TeamHG-Memex / Scrapy Rotating Proxies

Licence: mit
use multiple proxies with Scrapy

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Scrapy Rotating Proxies

Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (-61.89%)
Mutual labels:  scrapy, proxy
Fp Server
Free proxy server, continuously crawling and providing proxies, based on Tornado and Scrapy. 免费代理服务器,基于Tornado和Scrapy,在本地搭建属于自己的代理池
Stars: ✭ 154 (-68.44%)
Mutual labels:  scrapy, proxy
Scrapoxy
Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting!
Stars: ✭ 1,322 (+170.9%)
Mutual labels:  scrapy, proxy
Scrapy Crawlera
Crawlera middleware for Scrapy
Stars: ✭ 281 (-42.42%)
Mutual labels:  scrapy, proxy
Xx Mini
👻 XX-Net 精简版
Stars: ✭ 472 (-3.28%)
Mutual labels:  proxy
Disqusjs
💬 Render Disqus comments in Mainland China using Disqus API
Stars: ✭ 455 (-6.76%)
Mutual labels:  proxy
Fanqiang
翻墙-科学上网
Stars: ✭ 23,428 (+4700.82%)
Mutual labels:  proxy
Mssqlproxy
mssqlproxy is a toolkit aimed to perform lateral movement in restricted environments through a compromised Microsoft SQL Server via socket reuse
Stars: ✭ 433 (-11.27%)
Mutual labels:  proxy
Avege
Yet Another Redsocks Golang Fork
Stars: ✭ 486 (-0.41%)
Mutual labels:  proxy
Reactive React Redux
React Redux binding with React Hooks and Proxy
Stars: ✭ 480 (-1.64%)
Mutual labels:  proxy
Fetch Some Proxies
Simple Python script for fetching "some" (usable) proxies
Stars: ✭ 470 (-3.69%)
Mutual labels:  proxy
Proxymanager
🎩✨🌈 OOP Proxy wrappers/utilities - generates and manages proxies of your objects
Stars: ✭ 4,556 (+833.61%)
Mutual labels:  proxy
Proxyman
Configuring proxy settings made easy.
Stars: ✭ 472 (-3.28%)
Mutual labels:  proxy
Ergo
The management of multiple apps running over different ports made easy
Stars: ✭ 452 (-7.38%)
Mutual labels:  proxy
Lanproxy
lanproxy是一个将局域网个人电脑、服务器代理到公网的内网穿透工具,支持tcp流量转发,可支持任何tcp上层协议(访问内网网站、本地支付接口调试、ssh访问、远程桌面、http代理、https代理、socks5代理...)。技术交流QQ群 678776401
Stars: ✭ 4,784 (+880.33%)
Mutual labels:  proxy
Awesome Vpn
科学上网的有趣项目集锦,欢迎大家pr自己喜欢的项目到这里。
Stars: ✭ 445 (-8.81%)
Mutual labels:  proxy
Smocker
Smocker is a simple and efficient HTTP mock server and proxy.
Stars: ✭ 465 (-4.71%)
Mutual labels:  proxy
Nginx Le
Nginx with automatic let's encrypt (docker image)
Stars: ✭ 475 (-2.66%)
Mutual labels:  proxy
Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (-4.92%)
Mutual labels:  scrapy
Doux
🦄 Immutable reactivity system, made with ES6 Proxy.
Stars: ✭ 460 (-5.74%)
Mutual labels:  proxy

scrapy-rotating-proxies

.. image:: https://img.shields.io/pypi/v/scrapy-rotating-proxies.svg :target: https://pypi.python.org/pypi/scrapy-rotating-proxies :alt: PyPI Version

.. image:: https://travis-ci.org/TeamHG-Memex/scrapy-rotating-proxies.svg?branch=master :target: http://travis-ci.org/TeamHG-Memex/scrapy-rotating-proxies :alt: Build Status

.. image:: http://codecov.io/github/TeamHG-Memex/scrapy-rotating-proxies/coverage.svg?branch=master :target: http://codecov.io/github/TeamHG-Memex/scrapy-rotating-proxies?branch=master :alt: Code Coverage

This package provides a Scrapy_ middleware to use rotating proxies, check that they are alive and adjust crawling speed.

.. _Scrapy: https://scrapy.org/

License is MIT.

Installation

::

pip install scrapy-rotating-proxies

Usage

Add ROTATING_PROXY_LIST option with a list of proxies to settings.py::

ROTATING_PROXY_LIST = [
    'proxy1.com:8000',
    'proxy2.com:8031',
    # ...
]

As an alternative, you can specify a ROTATING_PROXY_LIST_PATH options with a path to a file with proxies, one per line::

ROTATING_PROXY_LIST_PATH = '/my/path/proxies.txt'

ROTATING_PROXY_LIST_PATH takes precedence over ROTATING_PROXY_LIST if both options are present.

Then add rotating_proxies middlewares to your DOWNLOADER_MIDDLEWARES::

DOWNLOADER_MIDDLEWARES = {
    # ...
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
    # ...
}

After this all requests will be proxied using one of the proxies from the ROTATING_PROXY_LIST / ROTATING_PROXY_LIST_PATH.

Requests with "proxy" set in their meta are not handled by scrapy-rotating-proxies. To disable proxying for a request set request.meta['proxy'] = None; to set proxy explicitly use request.meta['proxy'] = "<my-proxy-address>".

Concurrency

By default, all default Scrapy concurrency options (DOWNLOAD_DELAY, AUTHTHROTTLE_..., CONCURRENT_REQUESTS_PER_DOMAIN, etc) become per-proxy for proxied requests when RotatingProxyMiddleware is enabled. For example, if you set CONCURRENT_REQUESTS_PER_DOMAIN=2 then spider will be making at most 2 concurrent connections to each proxy, regardless of request url domain.

Customization

scrapy-rotating-proxies keeps track of working and non-working proxies, and re-checks non-working from time to time.

Detection of a non-working proxy is site-specific. By default, scrapy-rotating-proxies uses a simple heuristic: if a response status code is not 200, response body is empty or if there was an exception then proxy is considered dead.

You can override ban detection method by passing a path to a custom BanDectionPolicy in ROTATING_PROXY_BAN_POLICY option, e.g.::

# settings.py
ROTATING_PROXY_BAN_POLICY = 'myproject.policy.MyBanPolicy'

The policy must be a class with response_is_ban and exception_is_ban methods. These methods can return True (ban detected), False (not a ban) or None (unknown). It can be convenient to subclass and modify default BanDetectionPolicy::

# myproject/policy.py
from rotating_proxies.policy import BanDetectionPolicy

class MyPolicy(BanDetectionPolicy):
    def response_is_ban(self, request, response):
        # use default rules, but also consider HTTP 200 responses
        # a ban if there is 'captcha' word in response body.
        ban = super(MyPolicy, self).response_is_ban(request, response)
        ban = ban or b'captcha' in response.body
        return ban

    def exception_is_ban(self, request, exception):
        # override method completely: don't take exceptions in account
        return None

Instead of creating a policy you can also implement response_is_ban and exception_is_ban methods as spider methods, for example::

class MySpider(scrapy.Spider):
    # ...

    def response_is_ban(self, request, response):
        return b'banned' in response.body

    def exception_is_ban(self, request, exception):
        return None

It is important to have these rules correct because action for a failed request and a bad proxy should be different: if it is a proxy to blame it makes sense to retry the request with a different proxy.

Non-working proxies could become alive again after some time. scrapy-rotating-proxies uses a randomized exponential backoff for these checks - first check happens soon, if it still fails then next check is delayed further, etc. Use ROTATING_PROXY_BACKOFF_BASE to adjust the initial delay (by default it is random, from 0 to 5 minutes). The randomized exponential backoff is capped by ROTATING_PROXY_BACKOFF_CAP.

Settings

  • ROTATING_PROXY_LIST - a list of proxies to choose from;

  • ROTATING_PROXY_LIST_PATH - path to a file with a list of proxies;

  • ROTATING_PROXY_LOGSTATS_INTERVAL - stats logging interval in seconds, 30 by default;

  • ROTATING_PROXY_CLOSE_SPIDER - When True, spider is stopped if there are no alive proxies. If False (default), then when there is no alive proxies all dead proxies are re-checked.

  • ROTATING_PROXY_PAGE_RETRY_TIMES - a number of times to retry downloading a page using a different proxy. After this amount of retries failure is considered a page failure, not a proxy failure. Think of it this way: every improperly detected ban cost you ROTATING_PROXY_PAGE_RETRY_TIMES alive proxies. Default: 5.

    It is possible to change this option per-request using max_proxies_to_try request.meta key - for example, you can use a higher value for certain pages if you're sure they should work.

  • ROTATING_PROXY_BACKOFF_BASE - base backoff time, in seconds. Default is 300 (i.e. 5 min).

  • ROTATING_PROXY_BACKOFF_CAP - backoff time cap, in seconds. Default is 3600 (i.e. 60 min).

  • ROTATING_PROXY_BAN_POLICY - path to a ban detection policy. Default is 'rotating_proxies.policy.BanDetectionPolicy'.

FAQ

Q: Where to get proxy lists? How to write and maintain ban rules?

A: It is up to you to find proxies and maintain proper ban rules for web sites; scrapy-rotating-proxies doesn't have anything built-in. There are commercial proxy services like https://crawlera.com/ which can integrate with Scrapy (see https://github.com/scrapy-plugins/scrapy-crawlera) and take care of all these details.

Contributing

To run tests, install tox_ and run tox from the source checkout.

.. _tox: https://tox.readthedocs.io/en/latest/


.. image:: https://hyperiongray.s3.amazonaws.com/define-hg.svg :target: https://www.hyperiongray.com/?pk_campaign=github&pk_kwd=scrapy-rotating-proxies :alt: define hyperiongray

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].