All Projects → xiyouMc → Webhubbot

xiyouMc / Webhubbot

Licence: mit
Python + Scrapy + MongoDB . 5 million data per day !!!💥 The world's largest website.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Webhubbot

Python Spider
豆瓣电影top250、斗鱼爬取json数据以及爬取美女图片、淘宝、有缘、CrawlSpider爬取红娘网相亲人的部分基本信息以及红娘网分布式爬取和存储redis、爬虫小demo、Selenium、爬取多点、django开发接口、爬取有缘网信息、模拟知乎登录、模拟github登录、模拟图虫网登录、爬取多点商城整站数据、爬取微信公众号历史文章、爬取微信群或者微信好友分享的文章、itchat监听指定微信公众号分享的文章
Stars: ✭ 615 (-88.67%)
Mutual labels:  scrapy, mongodb
Docs
《数据采集从入门到放弃》源码。内容简介:爬虫介绍、就业情况、爬虫工程师面试题 ;HTTP协议介绍; Requests使用 ;解析器Xpath介绍; MongoDB与MySQL; 多线程爬虫; Scrapy介绍 ;Scrapy-redis介绍; 使用docker部署; 使用nomad管理docker集群; 使用EFK查询docker日志
Stars: ✭ 118 (-97.83%)
Mutual labels:  scrapy, mongodb
Jd spider
两只蠢萌京东的分布式爬虫.
Stars: ✭ 738 (-86.4%)
Mutual labels:  scrapy, mongodb
Scrapy demo
all kinds of scrapy demo
Stars: ✭ 128 (-97.64%)
Mutual labels:  scrapy, mongodb
Distributed Multi User Scrapy System With A Web Ui
Django based application that allows creating, deploying and running Scrapy spiders in a distributed manner
Stars: ✭ 88 (-98.38%)
Mutual labels:  scrapy, mongodb
Spider job
招聘网数据爬虫
Stars: ✭ 234 (-95.69%)
Mutual labels:  scrapy, mongodb
Node Express Mongodb Jwt Rest Api Skeleton
This is a basic API REST skeleton written on JavaScript using async/await. Great for building a starter web API for your front-end (Android, iOS, Vue, react, angular, or anything that can consume an API). Demo of frontend in VueJS here: https://github.com/davellanedam/vue-skeleton-mvp
Stars: ✭ 603 (-88.89%)
Mutual labels:  mongodb
Mongodb exporter
A Prometheus exporter for MongoDB including sharding, replication and storage engines
Stars: ✭ 602 (-88.91%)
Mutual labels:  mongodb
Springboot Starterkit
Starter Kit for Spring Boot based (REST APIs and WebMVC) micro services.
Stars: ✭ 596 (-89.02%)
Mutual labels:  mongodb
Easy Scraping Tutorial
Simple but useful Python web scraping tutorial code.
Stars: ✭ 583 (-89.26%)
Mutual labels:  scrapy
Meteor Collection Hooks
Meteor Collection Hooks
Stars: ✭ 641 (-88.19%)
Mutual labels:  mongodb
Mongo Rust Driver
The official MongoDB Rust Driver
Stars: ✭ 633 (-88.34%)
Mutual labels:  mongodb
Dev Setup
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
Stars: ✭ 5,590 (+3%)
Mutual labels:  mongodb
Mongokitten
Native MongoDB driver for Swift, written in Swift
Stars: ✭ 605 (-88.85%)
Mutual labels:  mongodb
Clean Ts Api
API em NodeJs usando Typescript, TDD, Clean Architecture, Design Patterns e SOLID principles
Stars: ✭ 619 (-88.59%)
Mutual labels:  mongodb
Scrapyrt
HTTP API for Scrapy spiders
Stars: ✭ 637 (-88.26%)
Mutual labels:  scrapy
Mongo Spark
The MongoDB Spark Connector
Stars: ✭ 588 (-89.17%)
Mutual labels:  mongodb
Pythonspidernotes
Python入门网络爬虫之精华版
Stars: ✭ 5,634 (+3.81%)
Mutual labels:  scrapy
Icrawler
A multi-thread crawler framework with many builtin image crawlers provided.
Stars: ✭ 629 (-88.41%)
Mutual labels:  scrapy
Injectify
Perform advanced MiTM attacks on websites with ease 💉
Stars: ✭ 612 (-88.72%)
Mutual labels:  mongodb

GitHub forks GitHub stars GitHub license

Disclaimer: This project is intended to study the Scrapy Spider Framework and the MongoDB database, it cannot be used for commercial or other personal intentions. If used improperly, it will be the individuals bear.

  • The project is mainly used for crawling a Website, the largest site in the world. In doing so it retrieves video titles, duration, mp4 link, cover url and direct Website`s url.
  • This project crawls PornHub.com quickly, but with a simple structure.
  • This project can crawl up to 5 millon Website`s videos per day, depending on your personal network. Because of my slow bandwith my results are relatively slow.
  • The crawler requests 10 threads at a time, and because of this can achieve the speed mentioned above. If your network is more performant you can request more threads and crawl a larger amount of videos per day. For the specific configuration see [pre-boot configuration]

Environment, Architecture

Language: Python2.7

Environment: MacOS, 4G RAM

Database: MongoDB

  • Mainly uses the scrapy reptile framework.
  • Join to the Spider randomly by extracted from the Cookie pool and UA pool.
  • Start_requests start five Request based on Website`s classification, and crawl the five categories at the same time.
  • Support paging crawl data, and join to the queue.

Instructions for use

Pre-boot configuration

  • Install MongoDB and start without configuration
  • Install Python dependent modules:Scrapy, pymongo, requests or pip install -r requirements.txt
  • Modify the configuration by needed, such as the interval time, the number of threads, etc.

Start up

  • cd WebHub
  • python quickstart.py

Run screenshots

Database description

The table in the database that holds the data is PhRes. The following is a field description:

PhRes table:

video_title:     The title of the video, and as a unique.
link_url:        Video jump to Website`s link
image_url:       Video cover link
video_duration:  The length of the video, in seconds
quality_480p:    Video 480p mp4 download address

For Chinese

  • 关注微信公众号,学习Python开发
图片名称
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].