All Projects → aaldaber → Distributed Multi User Scrapy System With A Web Ui

aaldaber / Distributed Multi User Scrapy System With A Web Ui

Django based application that allows creating, deploying and running Scrapy spiders in a distributed manner

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Distributed Multi User Scrapy System With A Web Ui

Python Spider
豆瓣电影top250、斗鱼爬取json数据以及爬取美女图片、淘宝、有缘、CrawlSpider爬取红娘网相亲人的部分基本信息以及红娘网分布式爬取和存储redis、爬虫小demo、Selenium、爬取多点、django开发接口、爬取有缘网信息、模拟知乎登录、模拟github登录、模拟图虫网登录、爬取多点商城整站数据、爬取微信公众号历史文章、爬取微信群或者微信好友分享的文章、itchat监听指定微信公众号分享的文章
Stars: ✭ 615 (+598.86%)
Mutual labels:  scrapy, mongodb, django
Crawl
selenium异步爬取网页图片
Stars: ✭ 13 (-85.23%)
Mutual labels:  django, rabbitmq
Seeker
Seeker - another job board aggregator.
Stars: ✭ 16 (-81.82%)
Mutual labels:  scrapy, django
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+1063.64%)
Mutual labels:  scrapy, django
Mall Swarm
mall-swarm是一套微服务商城系统,采用了 Spring Cloud Hoxton & Alibaba、Spring Boot 2.3、Oauth2、MyBatis、Docker、Elasticsearch、Kubernetes等核心技术,同时提供了基于Vue的管理后台方便快速搭建系统。mall-swarm在电商业务的基础集成了注册中心、配置中心、监控中心、网关等系统功能。文档齐全,附带全套Spring Cloud教程。
Stars: ✭ 7,874 (+8847.73%)
Mutual labels:  mongodb, rabbitmq
Funpyspidersearchengine
Word2vec 千人千面 个性化搜索 + Scrapy2.3.0(爬取数据) + ElasticSearch7.9.1(存储数据并提供对外Restful API) + Django3.1.1 搜索
Stars: ✭ 782 (+788.64%)
Mutual labels:  scrapy, django
Phalcon Vm
Vagrant configuration for PHP7, Phalcon 3.x and Zephir development.
Stars: ✭ 43 (-51.14%)
Mutual labels:  mongodb, rabbitmq
Webhubbot
Python + Scrapy + MongoDB . 5 million data per day !!!💥 The world's largest website.
Stars: ✭ 5,427 (+6067.05%)
Mutual labels:  scrapy, mongodb
Django Carrot
A lightweight task queue for Django using RabbitMQ
Stars: ✭ 58 (-34.09%)
Mutual labels:  django, rabbitmq
Mall Learning
mall学习教程,架构、业务、技术要点全方位解析。mall项目(40k+star)是一套电商系统,使用现阶段主流技术实现。涵盖了SpringBoot 2.3.0、MyBatis 3.4.6、Elasticsearch 7.6.2、RabbitMQ 3.7.15、Redis 5.0、MongoDB 4.2.5、Mysql5.7等技术,采用Docker容器化部署。
Stars: ✭ 10,236 (+11531.82%)
Mutual labels:  mongodb, rabbitmq
Spring Examples
SpringBoot Examples
Stars: ✭ 67 (-23.86%)
Mutual labels:  mongodb, rabbitmq
Jd spider
两只蠢萌京东的分布式爬虫.
Stars: ✭ 738 (+738.64%)
Mutual labels:  scrapy, mongodb
Bifrost
Bifrost ---- 面向生产环境的 MySQL 同步到Redis,MongoDB,ClickHouse,MySQL等服务的异构中间件
Stars: ✭ 701 (+696.59%)
Mutual labels:  mongodb, rabbitmq
Goodskill
🐂基于springcloud +dubbo构建的模拟秒杀项目,模块化设计,集成了分库分表、elasticsearch🔍、gateway、mybatis-plus、spring-session等常用开源组件
Stars: ✭ 786 (+793.18%)
Mutual labels:  mongodb, rabbitmq
Spring Boot Examples
about learning Spring Boot via examples. Spring Boot 教程、技术栈示例代码,快速简单上手教程。
Stars: ✭ 26,812 (+30368.18%)
Mutual labels:  mongodb, rabbitmq
Ihealth site
iHealth 项目的后台程序(一个基于 Django 和 MongoDB 的 Web 后端)
Stars: ✭ 29 (-67.05%)
Mutual labels:  mongodb, django
Machinery
Machinery is an asynchronous task queue/job queue based on distributed message passing.
Stars: ✭ 5,821 (+6514.77%)
Mutual labels:  mongodb, rabbitmq
Djongo
Django and MongoDB database connector
Stars: ✭ 1,222 (+1288.64%)
Mutual labels:  mongodb, django
Django Celery Tutorial
Django Celery Tutorial
Stars: ✭ 48 (-45.45%)
Mutual labels:  django, rabbitmq
Transporter
Sync data between persistence engines, like ETL only not stodgy
Stars: ✭ 1,175 (+1235.23%)
Mutual labels:  mongodb, rabbitmq

Distributed Multi-User Scrapy System with a Web UI

This is a Django project that lets users create, configure, deploy and run Scrapy spiders through a Web interface. The goal of this project is to build an application that would allow multiple users write their own scraping scripts and deploy them to a cluster of workers for scraping in a distributed fashion. The application allows the users do the following actions through a web interface:

  • Create a Scrapy project
  • Add/Edit/Delete Scrapy Items
  • Add/Edit/Delete Scrapy Item Pipelines
  • Edit Link Generator function (more on this below)
  • Edit Scraper function (more on this below)
  • Deploy the projects to worker machines
  • Start/Stop projects on worker machines
  • Display online status of the worker machines, the database, and the link queue
  • Display the deployment status of projects
  • Display the number of items scraped
  • Display the number of errors occured in a project while scraping
  • Display start/stop date and time for projects

Architecture

The application comes bundled with Scrapy pipeline for MongoDB (for saving the scraped items) and Scrapy scheduler for RabbitMQ (for distributing the links among workers). The code for these were taken and adapted from https://github.com/sebdah/scrapy-mongodb and https://github.com/roycehaynes/scrapy-rabbitmq. Here is what you need to run the application:

  • MongoDB server (can be standalone or a sharded cluster, replica sets were not tested)
  • RabbitMQ server
  • One link generator worker server with Scrapy installed and running scrapyd daemon
  • At least one scraper worker server with Scrapy installed and running scrapyd daemon

After you have all of the above up and running, fill sample_settings.py in root folder and scrapyproject/scrapy_packages/sample_settings.py files with needed information, rename both files to settings.py, and run the Django server (don't forget to perform the migrations first). You can go to http://localhost:8000/project/ to start creating your first project.

Link Generator

The link generator function is a function that will insert all the links that need to be scraped to the RabbitMQ queue. Scraper workers will be dequeueing those links, scraping the items and saving the items to MongoDB. The link generator itself is just a Scrapy spider written insde parse(self, response) function. The only thing different from the regular spider is that the link generator will not scrape and save items, it will only extract the needed links to be scraped and insert them to the RabbitMQ for scraper machines to consume.

Scrapers

The scraper function is a function that will take links from RabbitMQ, make a request to that link, parse the response, and save the items to DB. The scraper is also just a Scrapy spider, but without the functionality to add links to the queue.

This separation of roles allows to distribute the links to multiple scrapers evenly. There can be only one link generator per project, and unlimited number of scrapers.

RabbitMQ

When a project is deployed and run, the link generator will create a queue for the project in username_projectname:requests format, and will start inserting links. Scrapers will use RabbitMQ Scheduler in Scrapy to get one link at a time and process it.

MongoDB

All of the items that get scraped will be saved to MongoDB. There is no need to prepare the database or collections beforehand. When the first item gets saved to DB, the scraper will create a database in username_projectname format and will insert items to a collection named after the item's name defined in Scrapy. If you are using a sharded cluster of MongoDB servers, the scrapers will try to authoshard the database and the collections when saving the items. The hashed id key is used for sharding.

Here are the general steps that the application performs:

  1. You create a new project, define items, define item pipelines, add link generator and scraper functions, change settings
  2. Press Deploy the project
  3. The scripts and settings will be put into a standard Scrapy project folder structure (two folders will be created: one for link generator, one for scraper)
  4. The two folders will be packaged to .egg files
  5. Link generator egg file will be uploaded to the scrapyd server that was defined in settings file
  6. Scraper egg file will be uploaded to all scrapyd servers that were defined in settings file
  7. You start the link generator
  8. You start the scrapers

Installation

The web application requires:

  • Django 1.8.13
  • django-crispy-forms
  • django-registration
  • pymongo
  • requests
  • python-dateutil

On the link generator and scraper machines you need:

  • Scrapy
  • scrapyd
  • pymongo
  • pika

The dashboard theme used for the UI was retrieved from https://github.com/VinceG/Bootstrap-Admin-Theme.

Examples

Link generator and scraper functions are given in the examples folder.

Screenshots

Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text Alt text

License

This project is licensed under the terms of the MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].