All Projects → alecxe → Scrapy Fake Useragent

alecxe / Scrapy Fake Useragent

Licence: mit
Random User-Agent middleware based on fake-useragent

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Scrapy Fake Useragent

Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (-10.77%)
Mutual labels:  scrapy, web-scraping
Netflix Clone
Netflix like full-stack application with SPA client and backend implemented in service oriented architecture
Stars: ✭ 156 (-70%)
Mutual labels:  scrapy, web-scraping
Scrapy Craigslist
Web Scraping Craigslist's Engineering Jobs in NY with Scrapy
Stars: ✭ 54 (-89.62%)
Mutual labels:  scrapy, web-scraping
Faster Than Requests
Faster requests on Python 3
Stars: ✭ 639 (+22.88%)
Mutual labels:  scrapy, web-scraping
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-97.12%)
Mutual labels:  web-scraping, scrapy
Scrapyd Cluster On Heroku
Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉
Stars: ✭ 106 (-79.62%)
Mutual labels:  scrapy, web-scraping
Juno crawler
Scrapy crawler to collect data on the back catalog of songs listed for sale.
Stars: ✭ 150 (-71.15%)
Mutual labels:  scrapy, web-scraping
Scrapy Training
Scrapy Training companion code
Stars: ✭ 157 (-69.81%)
Mutual labels:  scrapy, web-scraping
scrapy-wayback-machine
A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Stars: ✭ 92 (-82.31%)
Mutual labels:  web-scraping, scrapy
City Scrapers
Scrape, standardize and share public meetings from local government websites
Stars: ✭ 220 (-57.69%)
Mutual labels:  scrapy, web-scraping
scraping-ebay
Scraping Ebay's products using Scrapy Web Crawling Framework
Stars: ✭ 79 (-84.81%)
Mutual labels:  web-scraping, scrapy
IMDB-Scraper
Scrapy project for scraping data from IMDB with Movie Dataset including 58,623 movies' data.
Stars: ✭ 37 (-92.88%)
Mutual labels:  web-scraping, scrapy
restaurant-finder-featureReviews
Build a Flask web application to help users retrieve key restaurant information and feature-based reviews (generated by applying market-basket model – Apriori algorithm and NLP on user reviews).
Stars: ✭ 21 (-95.96%)
Mutual labels:  web-scraping, scrapy
Ache
ACHE is a web crawler for domain-specific search.
Stars: ✭ 320 (-38.46%)
Mutual labels:  web-scraping
Files
Docs and files for ScrapydWeb, Scrapyd, Scrapy, and other projects
Stars: ✭ 390 (-25%)
Mutual labels:  scrapy
Elves
🎊 Design and implement of lightweight crawler framework.
Stars: ✭ 315 (-39.42%)
Mutual labels:  scrapy
Linkedin
Linkedin Scraper using Selenium Web Driver, Chromium headless, Docker and Scrapy
Stars: ✭ 309 (-40.58%)
Mutual labels:  scrapy
Advanced Web Scraping Tutorial
The Zipru scraper developed in the Advanced Web Scraping Tutorial.
Stars: ✭ 384 (-26.15%)
Mutual labels:  scrapy
Scrapy Crawlera
Crawlera middleware for Scrapy
Stars: ✭ 281 (-45.96%)
Mutual labels:  scrapy
Basketball reference web scraper
NBA Stats API via Basketball Reference
Stars: ✭ 279 (-46.35%)
Mutual labels:  web-scraping

.. image:: https://travis-ci.org/alecxe/scrapy-fake-useragent.svg?branch=master :target: https://travis-ci.org/alecxe/scrapy-fake-useragent

.. image:: https://codecov.io/gh/alecxe/scrapy-fake-useragent/branch/master/graph/badge.svg :target: https://codecov.io/gh/alecxe/scrapy-fake-useragent

.. image:: https://img.shields.io/pypi/pyversions/scrapy-fake-useragent.svg :target: https://pypi.python.org/pypi/scrapy-fake-useragent :alt: PyPI version

.. image:: https://badge.fury.io/py/scrapy-fake-useragent.svg :target: http://badge.fury.io/py/scrapy-fake-useragent :alt: PyPI version

.. image:: https://requires.io/github/alecxe/scrapy-fake-useragent/requirements.svg?branch=master :target: https://requires.io/github/alecxe/scrapy-fake-useragent/requirements/?branch=master :alt: Requirements Status

.. image:: https://img.shields.io/badge/license-MIT-blue.svg :target: https://github.com/alecxe/scrapy-fake-useragent/blob/master/LICENSE.txt :alt: Package license

scrapy-fake-useragent

Random User-Agent middleware for Scrapy scraping framework based on fake-useragent <https://pypi.python.org/pypi/fake-useragent>, which picks up User-Agent strings based on usage statistics <http://www.w3schools.com/browsers/browsers_stats.asp> from a real world database <http://useragentstring.com/>, but also has the option to configure a generator of fake UA strings, as a backup, powered by Faker <https://faker.readthedocs.io/en/stable/providers/faker.providers.user_agent.html>.

It also has the possibility of extending the capabilities of the middleware, by adding your own providers.

Changes

Please see CHANGELOG_.

Installation

The simplest way is to install it via pip:

pip install scrapy-fake-useragent

Configuration

Turn off the built-in UserAgentMiddleware and RetryMiddleware and add RandomUserAgentMiddleware and RetryUserAgentMiddleware.

In Scrapy >=1.0:

.. code:: python

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}

In Scrapy <1.0:

.. code:: python

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}

Recommended setting (1.3.0+):

.. code:: python

FAKEUSERAGENT_PROVIDERS = [
    'scrapy_fake_useragent.providers.FakeUserAgentProvider',  # this is the first provider we'll try
    'scrapy_fake_useragent.providers.FakerProvider',  # if FakeUserAgentProvider fails, we'll use faker to generate a user-agent string for us
    'scrapy_fake_useragent.providers.FixedUserAgentProvider',  # fall back to USER_AGENT value
]
USER_AGENT = '<your user agent string which you will fall back to if all other providers fail>'

Additional configuration information

Enabling providers

The package comes with a thin abstraction layer of User-Agent providers, which for purposes of backwards compatibility defaults to:

.. code:: python

FAKEUSERAGENT_PROVIDERS = [
    'scrapy_fake_useragent.providers.FakeUserAgentProvider'
]

The package has also FakerProvider (powered by Faker library <https://faker.readthedocs.io/>__) and FixedUserAgentProvider implemented and available for use if needed.

Each provider is enabled individually, and used in the order they are defined. In case a provider fails execute (for instance, it can happen <https://github.com/hellysmile/fake-useragent/issues/99>__ to fake-useragent because of it's dependency with an online service), the next one will be used.

Example of what FAKEUSERAGENT_PROVIDERS setting may look like in your case:

.. code:: python

FAKEUSERAGENT_PROVIDERS = [
    'scrapy_fake_useragent.providers.FakeUserAgentProvider',
    'scrapy_fake_useragent.providers.FakerProvider',
    'scrapy_fake_useragent.providers.FixedUserAgentProvider',
    'mypackage.providers.CustomProvider'
]

Configuring fake-useragent

Parameter: FAKE_USERAGENT_RANDOM_UA_TYPE defaulting to random.

Other options, as example:

  • firefox to mimic only Firefox browsers
  • msie to mimic Internet Explorer only
  • etc.

You can also set the FAKEUSERAGENT_FALLBACK option, which is a fake-useragent specific fallback. For example:

.. code:: python

FAKEUSERAGENT_FALLBACK = 'Mozilla/5.0 (Android; Mobile; rv:40.0)'

What it does is, if the selected FAKE_USERAGENT_RANDOM_UA_TYPE fails to retrieve a UA, it will use the type set in FAKEUSERAGENT_FALLBACK.

Configuring faker

Parameter: FAKER_RANDOM_UA_TYPE defaulting to user_agent which is the way of selecting totally random User-Agents values. Other options, as example:

  • chrome
  • firefox
  • safari
  • etc. (please refer to Faker UserAgent provider documentation <https://faker.readthedocs.io/en/master/providers/faker.providers.user_agent.html>_ for the available options)

Configuring FixedUserAgent

It also comes with a fixed provider (only provides one user agent), reusing the Scrapy's default USER_AGENT setting value.

Usage with scrapy-proxies

To use with middlewares of random proxy such as scrapy-proxies <https://github.com/aivarsk/scrapy-proxies>_, you need:

  1. set RANDOM_UA_PER_PROXY to True to allow switch per proxy

  2. set priority of RandomUserAgentMiddleware to be greater than scrapy-proxies, so that proxy is set before handle UA

License

The package is under MIT license. Please see LICENSE_.

.. |GitHub version| image:: https://badge.fury.io/gh/alecxe%2Fscrapy-fake-useragent.svg :target: http://badge.fury.io/gh/alecxe%2Fscrapy-fake-useragent .. |Requirements Status| image:: https://requires.io/github/alecxe/scrapy-fake-useragent/requirements.svg?branch=master :target: https://requires.io/github/alecxe/scrapy-fake-useragent/requirements/?branch=master .. _LICENSE: https://github.com/alecxe/scrapy-fake-useragent/blob/master/LICENSE.txt .. _CHANGELOG: https://github.com/alecxe/scrapy-fake-useragent/blob/master/CHANGELOG.rst

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].