zhongjiajie / Autohome

Licence: other

Using Scrapy to crawl Autohome, storage into MonogDB, simple analysis and NLP coming soon

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Autohome

Inventus

Inventus is a spider designed to find subdomains of a specific domain by crawling it and any subdomains it discovers.

Stars: ✭ 80 (+247.83%)

Mutual labels: scrapy

small-spider-project

日常爬虫

Stars: ✭ 14 (-39.13%)

Mutual labels: scrapy

JD Spider

👍 京东爬虫（大量注释，对刚入门爬虫者极度友好）

Stars: ✭ 56 (+143.48%)

Mutual labels: scrapy

scrapy-kafka-redis

Distributed crawling/scraping, Kafka And Redis based components for Scrapy

Stars: ✭ 45 (+95.65%)

Mutual labels: scrapy

scrapy-fieldstats

A Scrapy extension to log items coverage when the spider shuts down

Stars: ✭ 17 (-26.09%)

Mutual labels: scrapy

web full stack application

show full stack technology applications : Scrapy + webservice[restful] + websocket + VueJS + MongoDB

Stars: ✭ 16 (-30.43%)

Mutual labels: scrapy

scrapy-html-storage

Scrapy downloader middleware that stores response HTMLs to disk.

Stars: ✭ 17 (-26.09%)

Mutual labels: scrapy

scrapy spider

No description or website provided.

Stars: ✭ 58 (+152.17%)

Mutual labels: scrapy

Scrapy-SearchEngines

bing、google、baidu搜索引擎爬虫。python3.6 and scrapy

Stars: ✭ 28 (+21.74%)

Mutual labels: scrapy

ancient chinese

古汉语(文言文)字典-爬取文言文字典网,制作Kindle字典.

Stars: ✭ 48 (+108.7%)

Mutual labels: scrapy

hupu spider

虎扑步行街爬虫

Stars: ✭ 22 (-4.35%)

Mutual labels: scrapy

easypoi

简单、免费、高效的百度地图poi采集和分析工具。

Stars: ✭ 87 (+278.26%)

Mutual labels: scrapy

ufc fight predictor

UFC bout winner prediction using neural nets.

Stars: ✭ 22 (-4.35%)

Mutual labels: scrapy

RARBG-scraper

With Selenium headless browsing and CAPTCHA solving

Stars: ✭ 38 (+65.22%)

Mutual labels: scrapy

aioScrapy

基于asyncio与aiohttp的异步协程爬虫框架欢迎Star

Stars: ✭ 34 (+47.83%)

Mutual labels: scrapy

scrapy-mysql-pipeline

scrapy mysql pipeline

Stars: ✭ 47 (+104.35%)

Mutual labels: scrapy

scrapy-cloudflare-middleware

A Scrapy middleware to bypass the CloudFlare's anti-bot protection

Stars: ✭ 84 (+265.22%)

Mutual labels: scrapy

ScrapyProject

Scrapy项目（mysql+mongodb豆瓣top250电影）

Stars: ✭ 18 (-21.74%)

Mutual labels: scrapy

project pjx

Python分布式爬虫打造搜索引擎

Stars: ✭ 42 (+82.61%)

Mutual labels: scrapy

www job com

爬取拉勾、BOSS直聘、智联招聘、51job、赶集招聘、58招聘等职位信息

Stars: ✭ 47 (+104.35%)

Mutual labels: scrapy

View All Similar Projects ➔

Autohome

Autohome基于Scrapy爬虫框架,实现对汽车之家-文章进行定向爬虫，并将抓取的数据存放进MongoDB中。后期将对抓取数据进行简单的分析以及NLP的工作。

说明：了解最新版本移步到Autohome dev

运行环境

Python 2.7.10
MonogDB 3.2.10
Scrapy 1.3.2
pymongo 3.4.0

项目构成

│  readme.md
│  requirements.txt
│  scrapy.cfg
│
├─autohome
│  │  __init__.py
│  │  items.py
│  │  pipelines.py
│  │  settings.py
│  │
│  └─spiders
│          __init__.py
│          autohome_spider.py
│
└─support_file
    ├─architecture
    │      autohome_architecture.png
    │      autohome_architecture.vsdx
    │
    └─four_theme
            autohome_four_theme.png
            part1.png
            part2.png
            part3.png
            part4.png

autohome：是Autohome的程序的主要文件夹，主要的Autohome的代码都在里面，其中spiders子文件夹是spider的主程序
support_file：Autohome的支撑文件夹，只要存放说明相片以及原图片
scrapy.cfg：Autohome的配置文件夹
requirements.txt：Autohome依赖的第三方包的requirements

使用方式

安装Pyhton以及MongoDB
启动MongoDB
安装Autohome相关依赖。将cmd切换到Autohome根目录，运行

pip install -r requirements.txt

可能会提示pip不是内部或外部命令，也不是可运行的程序或批处理文件。，请点这里解决相应问题

根据需要选择数据下载的方式，默认同时下载到MongoDB和本地Json文件中，可以通过修改Autohome/autohome/settings.py中ITEM_PIPELINES进行选择（两个同时写入可能会导致磁盘I/O过高）
在Autohome根目录运行

scrapy crawl autohome_article

运行Autohome爬虫，其中日志文件会以运行爬虫的时间为名称写入Autohome根目录中,Autohome项目爬虫就会正常运行了

设计概览

爬虫设计概览

Autohome抓取的是汽车之家-文章页面，整个爬虫部分分成四大主题，分别是：文章简介、文章详情、文章评论、评论文章的用户。爬虫的根节点其中四个部分的逻辑如下：
Autohome基于Scrapy爬虫框架，对四大主题进行抓取，整个流程图如下，其中绿色部分是Scrapy原生框架的逻辑，蓝色部分是汽车之家-文章的爬虫逻辑

Features

全部基于Scrapy框架实现
定义两个Pipeline操作，分别是AutohomeJsonPipeline，即本地json文件；以及AutohomeMongodbPipeline，即存进MongoDB。可以在setting.py的ITEM_PIPELINES节点中设置启动的Pipeline

TODO

编写proxy和user agent中间件
优化模拟登陆的抓取速度及完整度
对抓取的结构化数据进行分析
对抓取的非结构化数据分析

Change Log

20170531 将原来自定义模块的爬虫程序切换到Scrapy爬虫框架

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

zhongjiajie / Autohome

Programming Languages

Labels

Projects that are alternatives of or similar to Autohome

Autohome

运行环境

项目构成

使用方式

设计概览

爬虫设计概览

Features

TODO

Change Log