All Projects → AccordBox → Awesome Scrapy

AccordBox / Awesome Scrapy

A curated list of awesome packages, articles, and other cool resources from the Scrapy community.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Awesome Scrapy

ptt-web-crawler
PTT 網路版爬蟲
Stars: ✭ 20 (-94.44%)
Mutual labels:  scrapy
python-Reptile
python-Reptile
Stars: ✭ 31 (-91.39%)
Mutual labels:  scrapy
Alltheplaces
A set of spiders and scrapers to extract location information from places that post their location on the internet.
Stars: ✭ 277 (-23.06%)
Mutual labels:  scrapy
memes-api
API for scrapping common meme sites
Stars: ✭ 17 (-95.28%)
Mutual labels:  scrapy
ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
Stars: ✭ 68 (-81.11%)
Mutual labels:  scrapy
Douban Crawler
Uno Crawler por https://douban.com
Stars: ✭ 13 (-96.39%)
Mutual labels:  scrapy
dannyAVgleDownloader
知名網站avgle下載器
Stars: ✭ 27 (-92.5%)
Mutual labels:  scrapy
Elves
🎊 Design and implement of lightweight crawler framework.
Stars: ✭ 315 (-12.5%)
Mutual labels:  scrapy
PttImageSpider
PTT 圖片下載器 (抓取整個看板的圖片,並用文章標題作為資料夾的名稱 ) (使用Scrapy)
Stars: ✭ 16 (-95.56%)
Mutual labels:  scrapy
Happy Spiders
🔧 🔩 🔨 收集整理了爬虫相关的工具、模拟登陆技术、代理IP、scrapy模板代码等内容。
Stars: ✭ 261 (-27.5%)
Mutual labels:  scrapy
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-93.89%)
Mutual labels:  scrapy
scrapyr
a simple & tiny scrapy clustering solution, considered a drop-in replacement for scrapyd
Stars: ✭ 50 (-86.11%)
Mutual labels:  scrapy
tripadvisor-scraper
TripAdvisor scraper
Stars: ✭ 63 (-82.5%)
Mutual labels:  scrapy
douban-spider
基于Scrapy框架的豆瓣电影爬虫
Stars: ✭ 25 (-93.06%)
Mutual labels:  scrapy
Scrapy Crawlera
Crawlera middleware for Scrapy
Stars: ✭ 281 (-21.94%)
Mutual labels:  scrapy
scrapy-pipelines
A collection of pipelines for Scrapy
Stars: ✭ 16 (-95.56%)
Mutual labels:  scrapy
ip proxy pool
Generating spiders dynamically to crawl and check those free proxy ip on the internet with scrapy.
Stars: ✭ 39 (-89.17%)
Mutual labels:  scrapy
Vault
swiss army knife for hackers
Stars: ✭ 346 (-3.89%)
Mutual labels:  scrapy
Linkedin
Linkedin Scraper using Selenium Web Driver, Chromium headless, Docker and Scrapy
Stars: ✭ 309 (-14.17%)
Mutual labels:  scrapy
Tieba spider
百度贴吧爬虫(基于scrapy和mysql)
Stars: ✭ 257 (-28.61%)
Mutual labels:  scrapy

Awesome Scrapy Awesome

A curated list of awesome packages, articles, and other cool resources from the Scrapy community. Scrapy is a fast high-level web crawling & scraping framework for Python.

Table of Contents

Apps

Visual Web Scraping

  • Portia Visual scraping for Scrapy

Distributed Spider

Scrapy Service

  • scrapyscript Run a Scrapy spider programmatically from a script or a Celery task - no project required.

  • scrapyd A service daemon to run Scrapy spiders

  • scrapyd-client Command line client for Scrapyd server

  • python-scrapyd-api A Python wrapper for working with Scrapyd's API.

  • SpiderKeeper A scalable admin ui for spider service

  • scrapyrt HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders.

Monitor

Avoid Ban

  • HttpProxyMiddleware A middleware for scrapy. Used to change HTTP proxy from time to time.

  • scrapy-proxies Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.

  • scrapy-rotating-proxies Use multiple proxies with Scrapy

  • scrapy-random-useragent Scrapy Middleware to set a random User-Agent for every Request.

  • scrapy-fake-useragent Random User-Agent middleware based on fake-useragent

  • scrapy-crawlera Crawlera routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems.

Data Processing

Process Javascript

Other Useful Extensions

  • scrapy-djangoitem Scrapy extension to write scraped items using Django models

  • scrapy-deltafetch Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls

  • scrapy-crawl-once This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.

  • scrapy-magicfields Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.

  • scrapy-pagestorage A scrapy extension to store requests and responses information in storage service.

Resources

Articles

Exercises

Video

Book

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].