Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).

Stars: ✭ 34 (+54.55%)

Mutual labels: scrapy

lgcrawl

python+scrapy+splash 爬取拉勾全站职位信息

Stars: ✭ 22 (+0%)

Mutual labels: scrapy

RARBG-scraper

With Selenium headless browsing and CAPTCHA solving

Stars: ✭ 38 (+72.73%)

Mutual labels: scrapy

scrapy-LBC

Araignée LeBonCoin avec Scrapy et ElasticSearch

Stars: ✭ 14 (-36.36%)

Mutual labels: scrapy

fernando-pessoa

Classificador de poemas do Fernando Pessoa de acordo com os seus heterônimos

Stars: ✭ 31 (+40.91%)

Mutual labels: scrapy

Web-Iota

Iota is a web scraper which can find all of the images and links/suburls on a webpage

Stars: ✭ 60 (+172.73%)

Mutual labels: scrapy

crawler

python爬虫项目集合

Stars: ✭ 29 (+31.82%)

Mutual labels: scrapy

scrapy-wayback-machine

A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

Stars: ✭ 92 (+318.18%)

Mutual labels: scrapy

scrapy-rotated-proxy

A scrapy middleware to use rotated proxy ip list.

Stars: ✭ 22 (+0%)

Mutual labels: scrapy

scrapy-mysql-pipeline

scrapy mysql pipeline

Stars: ✭ 47 (+113.64%)

Mutual labels: scrapy

arche

Analyze scraped data

Stars: ✭ 49 (+122.73%)

Mutual labels: scrapy

double-agent

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

Stars: ✭ 123 (+459.09%)

Mutual labels: scrapy

scrapy-kafka-redis

Distributed crawling/scraping, Kafka And Redis based components for Scrapy

Stars: ✭ 45 (+104.55%)

Mutual labels: scrapy

Inventus

Inventus is a spider designed to find subdomains of a specific domain by crawling it and any subdomains it discovers.

Stars: ✭ 80 (+263.64%)

Mutual labels: scrapy

itemadapter

Common interface for data container classes

Stars: ✭ 47 (+113.64%)

Mutual labels: scrapy

View All Similar Projects ➔

hupu_spider

用scrapy框架写了一个爬取虎扑步行街帖子的爬虫步行街地址： https://bbs.hupu.com/bxj

目前完成功能

爬取帖子，获取作者，发帖时间，帖子浏览数，回帖数等信息存到数据库中
爬取帖子内容，获取回帖的数据并插入到数据库中
下载帖子内容中的图片

待完善

随机切换user-agent.目前没有设置user-agent，虎扑估计没怎么做反爬，所以目前没遇到什么问题。但是为了避免将来爬虫突然失效，还是做一下user-agent设置比较稳妥
添加代理，防止ip被封

安装步骤

需要安装的软件

python3
mysql

需要安装的python库:

scrapy
pillow(用于下载图片，如果不用下载图片可以不用这个库)
DBUtils
pymysql

pip install scrapy
pip install pillo
pip install DBUtils
pip install pymysql

当需要的软件和库都安装好后，进行以下步骤

进入mysql环境，创建数据库。create database hupu。当然，数据库不叫虎扑也可以，到时记得改项目中的配置文件
执行该项目中的 mysql_db/hupu_post.sql 创建相关的表。
修改db_config.py中的数据库配置。包括用户密码，数据库等。
如果要下载图片，需要修改settings.py中的IMAGES_STORE变量，同时将ITEM_PIPELINES变量的HupuImgDownloadPipeline那行的注释去掉。项目默认是不下载图片的。
执行./run.sh运行程序，window环境下直接在hupu_spider目录下执行 scrapy crawl hupu_post 也是一样的。

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

kongtrio / hupu_spider

Programming Languages

Labels

Projects that are alternatives of or similar to hupu spider

hupu_spider

目前完成功能

待完善

安装步骤

相关问题

1. 执行发现报`ModuleNotFoundError: No module named '_sqlite3'`错误

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

kongtrio / hupu_spider

Programming Languages

Labels

Projects that are alternatives of or similar to hupu spider

hupu_spider

目前完成功能

待完善

安装步骤

相关问题

1. 执行发现报ModuleNotFoundError: No module named '_sqlite3'错误

1. 执行发现报`ModuleNotFoundError: No module named '_sqlite3'`错误