All Projects → kongtrio → hupu_spider

kongtrio / hupu_spider

Licence: other
虎扑步行街爬虫

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to hupu spider

Scrapy-tripadvisor-reviews
Using scrapy to scrape tripadvisor in order to get users' reviews.
Stars: ✭ 24 (+9.09%)
Mutual labels:  scrapy
Scrape-Finance-Data
My code for scraping financial data in Vietnam
Stars: ✭ 13 (-40.91%)
Mutual labels:  scrapy
scrapy-html-storage
Scrapy downloader middleware that stores response HTMLs to disk.
Stars: ✭ 17 (-22.73%)
Mutual labels:  scrapy
scrapy helper
Dynamic configurable crawl (动态可配置化爬虫)
Stars: ✭ 84 (+281.82%)
Mutual labels:  scrapy
vietnam-ecommerce-crawler
Crawling the data from lazada, websosanh, compare.vn, cdiscount and cungmua with flexible configs
Stars: ✭ 28 (+27.27%)
Mutual labels:  scrapy
ArticleSpider
Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).
Stars: ✭ 34 (+54.55%)
Mutual labels:  scrapy
lgcrawl
python+scrapy+splash 爬取拉勾全站职位信息
Stars: ✭ 22 (+0%)
Mutual labels:  scrapy
RARBG-scraper
With Selenium headless browsing and CAPTCHA solving
Stars: ✭ 38 (+72.73%)
Mutual labels:  scrapy
scrapy-LBC
Araignée LeBonCoin avec Scrapy et ElasticSearch
Stars: ✭ 14 (-36.36%)
Mutual labels:  scrapy
fernando-pessoa
Classificador de poemas do Fernando Pessoa de acordo com os seus heterônimos
Stars: ✭ 31 (+40.91%)
Mutual labels:  scrapy
Web-Iota
Iota is a web scraper which can find all of the images and links/suburls on a webpage
Stars: ✭ 60 (+172.73%)
Mutual labels:  scrapy
crawler
python爬虫项目集合
Stars: ✭ 29 (+31.82%)
Mutual labels:  scrapy
scrapy-wayback-machine
A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Stars: ✭ 92 (+318.18%)
Mutual labels:  scrapy
scrapy-rotated-proxy
A scrapy middleware to use rotated proxy ip list.
Stars: ✭ 22 (+0%)
Mutual labels:  scrapy
scrapy-mysql-pipeline
scrapy mysql pipeline
Stars: ✭ 47 (+113.64%)
Mutual labels:  scrapy
arche
Analyze scraped data
Stars: ✭ 49 (+122.73%)
Mutual labels:  scrapy
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+459.09%)
Mutual labels:  scrapy
scrapy-kafka-redis
Distributed crawling/scraping, Kafka And Redis based components for Scrapy
Stars: ✭ 45 (+104.55%)
Mutual labels:  scrapy
Inventus
Inventus is a spider designed to find subdomains of a specific domain by crawling it and any subdomains it discovers.
Stars: ✭ 80 (+263.64%)
Mutual labels:  scrapy
itemadapter
Common interface for data container classes
Stars: ✭ 47 (+113.64%)
Mutual labels:  scrapy

hupu_spider

用scrapy框架写了一个爬取虎扑步行街帖子的爬虫 步行街地址: https://bbs.hupu.com/bxj

目前完成功能

  1. 爬取帖子,获取作者,发帖时间,帖子浏览数,回帖数等信息存到数据库中
  2. 爬取帖子内容,获取回帖的数据并插入到数据库中
  3. 下载帖子内容中的图片

待完善

  1. 随机切换user-agent.目前没有设置user-agent,虎扑估计没怎么做反爬,所以目前没遇到什么问题。但是为了避免将来爬虫突然失效,还是做一下user-agent设置比较稳妥
  2. 添加代理,防止ip被封

安装步骤

需要安装的软件

  • python3
  • mysql

需要安装的python库:

  • scrapy
  • pillow(用于下载图片,如果不用下载图片可以不用这个库)
  • DBUtils
  • pymysql
pip install scrapy
pip install pillo
pip install DBUtils
pip install pymysql

当需要的软件和库都安装好后,进行以下步骤

  1. 进入mysql环境,创建数据库。create database hupu。当然,数据库不叫虎扑也可以,到时记得改项目中的配置文件
  2. 执行该项目中的 mysql_db/hupu_post.sql 创建相关的表。
  3. 修改db_config.py中的数据库配置。包括用户密码,数据库等。
  4. 如果要下载图片,需要修改settings.py中的IMAGES_STORE变量,同时将ITEM_PIPELINES变量的HupuImgDownloadPipeline那行的注释去掉。项目默认是不下载图片的。
  5. 执行./run.sh运行程序,window环境下直接在hupu_spider目录下执行 scrapy crawl hupu_post 也是一样的。

相关问题

1. 执行发现报ModuleNotFoundError: No module named '_sqlite3'错误

说明服务器可能没安装sqlite3.可以执行yum install sqlite*来安装。如果你的服务器没有yum,或者yum安装sqlite失败(yum安装失败建议更换一下yum源再尝试)。 那就去下载一下sqlite的包,然后自己编译安装了。具体安装教程可以去网上找,有很多。 安装完sqlite的包后需要重新编译安装一下python才可以生效。 安装编译python教程:https://blog.csdn.net/u013332124/article/details/80643371

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].