matejbasic / PythonScrapyBasicSetup

Licence: MIT license

Basic setup with random user agents and IP addresses for Python Scrapy Framework.

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to PythonScrapyBasicSetup

A Scraper made 100% in Python using BeautifulSoup and Tor. It can be used to scrape both normal and onion links. Happy Scraping :)

Stars: ✭ 24 (-57.89%)

Mutual labels: scraping, tor

Autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Stars: ✭ 4,077 (+7052.63%)

Mutual labels: scraping, web-scraping

raspagem-de-dados-fatec

📓 Minicurso de raspagem de dados web com Python ministrado na Semana de Tecnologia da FATEC Jundiaí

Stars: ✭ 22 (-61.4%)

Mutual labels: scraping, web-scraping

torchestrator

Spin up Tor containers and then proxy HTTP requests via these Tor instances

Stars: ✭ 32 (-43.86%)

Mutual labels: scraping, tor

Humanoid

Node.js package to bypass CloudFlare's anti-bot JavaScript challenges

Stars: ✭ 88 (+54.39%)

Mutual labels: scraping, web-scraping

top-github-scraper

Scape top GitHub repositories and users based on keywords

Stars: ✭ 40 (-29.82%)

Mutual labels: scraping, web-scraping

Gopa

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

Stars: ✭ 277 (+385.96%)

Mutual labels: scraping, web-scraping

trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Stars: ✭ 711 (+1147.37%)

Mutual labels: scraping, web-scraping

Detect Cms

PHP Library for detecting CMS

Stars: ✭ 78 (+36.84%)

Mutual labels: scraping, web-scraping

Scrapple

A framework for creating semi-automatic web content extractors

Stars: ✭ 464 (+714.04%)

Mutual labels: scraping, web-scraping

browser-pool

A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.

Stars: ✭ 71 (+24.56%)

Mutual labels: scraping, web-scraping

Phpscraper

PHP Scraper - an highly opinionated web-interface for PHP

Stars: ✭ 148 (+159.65%)

Mutual labels: scraping, web-scraping

selectorlib

A library to read a YML file with Xpath or CSS Selectors and extract data from HTML pages using them

Stars: ✭ 53 (-7.02%)

Mutual labels: scraping, web-scraping

papercut

Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.

Stars: ✭ 15 (-73.68%)

Mutual labels: scraping, web-scraping

ioweb

Web Scraping Framework

Stars: ✭ 31 (-45.61%)

Mutual labels: scraping, web-scraping

Apify Js

Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.

Stars: ✭ 3,154 (+5433.33%)

Mutual labels: scraping, web-scraping

IMDB-Scraper

Scrapy project for scraping data from IMDB with Movie Dataset including 58,623 movies' data.

Stars: ✭ 37 (-35.09%)

Mutual labels: web-scraping, scrapy-framework

Katana

A Python Tool For google Hacking

Stars: ✭ 355 (+522.81%)

Mutual labels: scraping, tor

Sqrape

Simple Query Scraping with CSS and Go Reflection (MOVED to Gitlab)

Stars: ✭ 144 (+152.63%)

Mutual labels: scraping, web-scraping

Scrape Linkedin Selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.

Stars: ✭ 239 (+319.3%)

Mutual labels: scraping, web-scraping

View All Similar Projects ➔

PythonScrapyBasicSetup

Basic setup with random user agents and proxy addresses for Python Scrapy Framework.

Setup

1. Install Scrapy Framework

pip install Scrapy

Detailed installation guide

2. Install Beautiful Soup 4 for proxy middleware based on proxydocker lists

pip install beautifulsoup4

Detailed installation guide

3. Install Tor, Stem (controller library for Tor), and Privoxy (HTTP proxy server).

apt-get install tor python-stem privoxy

Hash a password with Tor:

tor --hash-password secretPassword

Then copy a hashed password and paste it with control port to /etc/tor/torrc:

ControlPort 9051
HashedControlPassword 16:72C8ADB0E34F8DA1606BB154886604F708236C0D0A54557A07B00CAB73

Restart Tor:

sudo /etc/init.d/tor restart

Enable Privoxy forwarding by adding next line to /etc/privoxy/config:

forward-socks5 / localhost:9050 .

Restart Privoxy:

sudo /etc/init.d/privoxy restart

Both Tor and Privoxy should be up & running (check netstat -l). If you used different password or control port, update settings.py.

If you get some errors regarding the pyOpenSSL (check this issue), try to downgrade the Twisted engine:

pip install Twisted==16.4.1

Usage

To see what it does just:

python run.py

Project contains three middleware classes in middlewares directory. ProxyMiddleware downloads IP proxy addresses and before every process request chooses one randomly. TorMiddleware has a similar purpose, but it relies on Tor network. RandomUserAgentMiddleware downloads user agent strings and saves them into 'USER_AGENT_LIST' settings list. It also selects one UA randomly before every process request. Middlewares are activated in settings.py file. This project also contains two spiders just for testing purposes, spiders/iptester.py and spiders/uatester.py. You can run them individually:

scrapy crawl UAtester
scrapy crawl IPtester

run.py file is a also good example how to include and run your spiders sequentially from one script.

If you have any questions or problems, feel free to create a new issue. Scrap responsibly!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

matejbasic / PythonScrapyBasicSetup

Programming Languages

Labels

Projects that are alternatives of or similar to PythonScrapyBasicSetup

PythonScrapyBasicSetup

Setup

1. Install Scrapy Framework

2. Install Beautiful Soup 4 for proxy middleware based on proxydocker lists

3. Install Tor, Stem (controller library for Tor), and Privoxy (HTTP proxy server).

Usage