All Projects → matejbasic → PythonScrapyBasicSetup

matejbasic / PythonScrapyBasicSetup

Licence: MIT license
Basic setup with random user agents and IP addresses for Python Scrapy Framework.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to PythonScrapyBasicSetup

TorScrapper
A Scraper made 100% in Python using BeautifulSoup and Tor. It can be used to scrape both normal and onion links. Happy Scraping :)
Stars: ✭ 24 (-57.89%)
Mutual labels:  scraping, tor
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+7052.63%)
Mutual labels:  scraping, web-scraping
raspagem-de-dados-fatec
📓 Minicurso de raspagem de dados web com Python ministrado na Semana de Tecnologia da FATEC Jundiaí
Stars: ✭ 22 (-61.4%)
Mutual labels:  scraping, web-scraping
torchestrator
Spin up Tor containers and then proxy HTTP requests via these Tor instances
Stars: ✭ 32 (-43.86%)
Mutual labels:  scraping, tor
Humanoid
Node.js package to bypass CloudFlare's anti-bot JavaScript challenges
Stars: ✭ 88 (+54.39%)
Mutual labels:  scraping, web-scraping
top-github-scraper
Scape top GitHub repositories and users based on keywords
Stars: ✭ 40 (-29.82%)
Mutual labels:  scraping, web-scraping
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+385.96%)
Mutual labels:  scraping, web-scraping
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+1147.37%)
Mutual labels:  scraping, web-scraping
Detect Cms
PHP Library for detecting CMS
Stars: ✭ 78 (+36.84%)
Mutual labels:  scraping, web-scraping
Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (+714.04%)
Mutual labels:  scraping, web-scraping
browser-pool
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
Stars: ✭ 71 (+24.56%)
Mutual labels:  scraping, web-scraping
Phpscraper
PHP Scraper - an highly opinionated web-interface for PHP
Stars: ✭ 148 (+159.65%)
Mutual labels:  scraping, web-scraping
selectorlib
A library to read a YML file with Xpath or CSS Selectors and extract data from HTML pages using them
Stars: ✭ 53 (-7.02%)
Mutual labels:  scraping, web-scraping
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-73.68%)
Mutual labels:  scraping, web-scraping
ioweb
Web Scraping Framework
Stars: ✭ 31 (-45.61%)
Mutual labels:  scraping, web-scraping
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+5433.33%)
Mutual labels:  scraping, web-scraping
IMDB-Scraper
Scrapy project for scraping data from IMDB with Movie Dataset including 58,623 movies' data.
Stars: ✭ 37 (-35.09%)
Mutual labels:  web-scraping, scrapy-framework
Katana
A Python Tool For google Hacking
Stars: ✭ 355 (+522.81%)
Mutual labels:  scraping, tor
Sqrape
Simple Query Scraping with CSS and Go Reflection (MOVED to Gitlab)
Stars: ✭ 144 (+152.63%)
Mutual labels:  scraping, web-scraping
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+319.3%)
Mutual labels:  scraping, web-scraping

PythonScrapyBasicSetup

Basic setup with random user agents and proxy addresses for Python Scrapy Framework.

Setup

1. Install Scrapy Framework
pip install Scrapy

Detailed installation guide

2. Install Beautiful Soup 4 for proxy middleware based on proxydocker lists
pip install beautifulsoup4

Detailed installation guide

3. Install Tor, Stem (controller library for Tor), and Privoxy (HTTP proxy server).
apt-get install tor python-stem privoxy

Hash a password with Tor:

tor --hash-password secretPassword

Then copy a hashed password and paste it with control port to /etc/tor/torrc:

ControlPort 9051
HashedControlPassword 16:72C8ADB0E34F8DA1606BB154886604F708236C0D0A54557A07B00CAB73

Restart Tor:

sudo /etc/init.d/tor restart

Enable Privoxy forwarding by adding next line to /etc/privoxy/config:

forward-socks5 / localhost:9050 .

Restart Privoxy:

sudo /etc/init.d/privoxy restart

Both Tor and Privoxy should be up & running (check netstat -l). If you used different password or control port, update settings.py.

If you get some errors regarding the pyOpenSSL (check this issue), try to downgrade the Twisted engine:

pip install Twisted==16.4.1

Usage

To see what it does just:

python run.py

Project contains three middleware classes in middlewares directory. ProxyMiddleware downloads IP proxy addresses and before every process request chooses one randomly. TorMiddleware has a similar purpose, but it relies on Tor network. RandomUserAgentMiddleware downloads user agent strings and saves them into 'USER_AGENT_LIST' settings list. It also selects one UA randomly before every process request. Middlewares are activated in settings.py file. This project also contains two spiders just for testing purposes, spiders/iptester.py and spiders/uatester.py. You can run them individually:

scrapy crawl UAtester
scrapy crawl IPtester

run.py file is a also good example how to include and run your spiders sequentially from one script.

If you have any questions or problems, feel free to create a new issue. Scrap responsibly!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].