Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → xiyouMc → Webhubbot

xiyouMc / Webhubbot

Licence: mit

Python + Scrapy + MongoDB . 5 million data per day !!!💥 The world's largest website.

Programming Languages

139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Webhubbot

豆瓣电影top250、斗鱼爬取json数据以及爬取美女图片、淘宝、有缘、CrawlSpider爬取红娘网相亲人的部分基本信息以及红娘网分布式爬取和存储redis、爬虫小demo、Selenium、爬取多点、django开发接口、爬取有缘网信息、模拟知乎登录、模拟github登录、模拟图虫网登录、爬取多点商城整站数据、爬取微信公众号历史文章、爬取微信群或者微信好友分享的文章、itchat监听指定微信公众号分享的文章

Stars: ✭ 615 (-88.67%)

Mutual labels: scrapy, mongodb

《数据采集从入门到放弃》源码。内容简介：爬虫介绍、就业情况、爬虫工程师面试题；HTTP协议介绍； Requests使用；解析器Xpath介绍； MongoDB与MySQL；多线程爬虫； Scrapy介绍；Scrapy-redis介绍；使用docker部署；使用nomad管理docker集群；使用EFK查询docker日志

Stars: ✭ 118 (-97.83%)

Mutual labels: scrapy, mongodb

两只蠢萌京东的分布式爬虫.

Stars: ✭ 738 (-86.4%)

Mutual labels: scrapy, mongodb

all kinds of scrapy demo

Stars: ✭ 128 (-97.64%)

Mutual labels: scrapy, mongodb

Distributed Multi User Scrapy System With A Web Ui

Django based application that allows creating, deploying and running Scrapy spiders in a distributed manner

Stars: ✭ 88 (-98.38%)

Mutual labels: scrapy, mongodb

招聘网数据爬虫

Stars: ✭ 234 (-95.69%)

Mutual labels: scrapy, mongodb

Node Express Mongodb Jwt Rest Api Skeleton

This is a basic API REST skeleton written on JavaScript using async/await. Great for building a starter web API for your front-end (Android, iOS, Vue, react, angular, or anything that can consume an API). Demo of frontend in VueJS here: https://github.com/davellanedam/vue-skeleton-mvp

Stars: ✭ 603 (-88.89%)

Mutual labels: mongodb

Mongodb exporter

A Prometheus exporter for MongoDB including sharding, replication and storage engines

Stars: ✭ 602 (-88.91%)

Mutual labels: mongodb

Springboot Starterkit

Starter Kit for Spring Boot based (REST APIs and WebMVC) micro services.

Stars: ✭ 596 (-89.02%)

Mutual labels: mongodb

Easy Scraping Tutorial

Simple but useful Python web scraping tutorial code.

Stars: ✭ 583 (-89.26%)

Mutual labels: scrapy

Meteor Collection Hooks

Meteor Collection Hooks

Stars: ✭ 641 (-88.19%)

Mutual labels: mongodb

Mongo Rust Driver

The official MongoDB Rust Driver

Stars: ✭ 633 (-88.34%)

Mutual labels: mongodb

macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

Stars: ✭ 5,590 (+3%)

Mutual labels: mongodb

Native MongoDB driver for Swift, written in Swift

Stars: ✭ 605 (-88.85%)

Mutual labels: mongodb

API em NodeJs usando Typescript, TDD, Clean Architecture, Design Patterns e SOLID principles

Stars: ✭ 619 (-88.59%)

Mutual labels: mongodb

HTTP API for Scrapy spiders

Stars: ✭ 637 (-88.26%)

Mutual labels: scrapy

The MongoDB Spark Connector

Stars: ✭ 588 (-89.17%)

Mutual labels: mongodb

Pythonspidernotes

Python入门网络爬虫之精华版

Stars: ✭ 5,634 (+3.81%)

Mutual labels: scrapy

A multi-thread crawler framework with many builtin image crawlers provided.

Stars: ✭ 629 (-88.41%)

Mutual labels: scrapy

Perform advanced MiTM attacks on websites with ease 💉

Stars: ✭ 612 (-88.72%)

Mutual labels: mongodb

View All Similar Projects ➔

Disclaimer: This project is intended to study the Scrapy Spider Framework and the MongoDB database, it cannot be used for commercial or other personal intentions. If used improperly, it will be the individuals bear.

The project is mainly used for crawling a Website, the largest site in the world. In doing so it retrieves video titles, duration, mp4 link, cover url and direct Website`s url.
This project crawls PornHub.com quickly, but with a simple structure.
This project can crawl up to 5 millon Website`s videos per day, depending on your personal network. Because of my slow bandwith my results are relatively slow.
The crawler requests 10 threads at a time, and because of this can achieve the speed mentioned above. If your network is more performant you can request more threads and crawl a larger amount of videos per day. For the specific configuration see [pre-boot configuration]

Environment, Architecture

Language: Python2.7

Environment: MacOS, 4G RAM

Database: MongoDB

Mainly uses the scrapy reptile framework.
Join to the Spider randomly by extracted from the Cookie pool and UA pool.
Start_requests start five Request based on Website`s classification, and crawl the five categories at the same time.
Support paging crawl data, and join to the queue.

Instructions for use

Pre-boot configuration

Install MongoDB and start without configuration
Install Python dependent modules：Scrapy, pymongo, requests or pip install -r requirements.txt
Modify the configuration by needed, such as the interval time, the number of threads, etc.

Start up

cd WebHub
python quickstart.py

Run screenshots

Database description

The table in the database that holds the data is PhRes. The following is a field description:

PhRes table：

video_title:     The title of the video, and as a unique.
link_url:        Video jump to Website`s link
image_url:       Video cover link
video_duration:  The length of the video, in seconds
quality_480p:    Video 480p mp4 download address

For Chinese

关注微信公众号，学习Python开发

图片名称

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 5,427

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (21) 🔗