All Projects → hackfengJam → ArticleSpider

hackfengJam / ArticleSpider

Licence: other
Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ArticleSpider

Filesensor
Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具
Stars: ✭ 227 (+567.65%)
Mutual labels:  scrapy
arche
Analyze scraped data
Stars: ✭ 49 (+44.12%)
Mutual labels:  scrapy
crawler
python爬虫项目集合
Stars: ✭ 29 (-14.71%)
Mutual labels:  scrapy
Spider job
招聘网数据爬虫
Stars: ✭ 234 (+588.24%)
Mutual labels:  scrapy
pagser
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler
Stars: ✭ 82 (+141.18%)
Mutual labels:  scrapy
scrapy-rotated-proxy
A scrapy middleware to use rotated proxy ip list.
Stars: ✭ 22 (-35.29%)
Mutual labels:  scrapy
Spiderkeeper
admin ui for scrapy/open source scrapinghub
Stars: ✭ 2,562 (+7435.29%)
Mutual labels:  scrapy
Scrape-Finance-Data
My code for scraping financial data in Vietnam
Stars: ✭ 13 (-61.76%)
Mutual labels:  scrapy
lgcrawl
python+scrapy+splash 爬取拉勾全站职位信息
Stars: ✭ 22 (-35.29%)
Mutual labels:  scrapy
asyncpy
使用asyncio和aiohttp开发的轻量级异步协程web爬虫框架
Stars: ✭ 86 (+152.94%)
Mutual labels:  scrapy
Awesome crawl
腾讯新闻、知乎话题、微博粉丝,Tumblr爬虫、斗鱼弹幕、妹子图爬虫、分布式设计等
Stars: ✭ 246 (+623.53%)
Mutual labels:  scrapy
domains
World’s single largest Internet domains dataset
Stars: ✭ 461 (+1255.88%)
Mutual labels:  scrapy
scrapy helper
Dynamic configurable crawl (动态可配置化爬虫)
Stars: ✭ 84 (+147.06%)
Mutual labels:  scrapy
Ecommercecrawlers
码云仓库链接:AJay13/ECommerceCrawlers Github 仓库链接:DropsDevopsOrg/ECommerceCrawlers 项目展示平台链接:http://wechat.doonsec.com
Stars: ✭ 3,073 (+8938.24%)
Mutual labels:  scrapy
vietnam-ecommerce-crawler
Crawling the data from lazada, websosanh, compare.vn, cdiscount and cungmua with flexible configs
Stars: ✭ 28 (-17.65%)
Mutual labels:  scrapy
Scrapy Splash
Scrapy+Splash for JavaScript integration
Stars: ✭ 2,666 (+7741.18%)
Mutual labels:  scrapy
Scrapy-tripadvisor-reviews
Using scrapy to scrape tripadvisor in order to get users' reviews.
Stars: ✭ 24 (-29.41%)
Mutual labels:  scrapy
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+261.76%)
Mutual labels:  scrapy
scrapy-LBC
Araignée LeBonCoin avec Scrapy et ElasticSearch
Stars: ✭ 14 (-58.82%)
Mutual labels:  scrapy
Web-Iota
Iota is a web scraper which can find all of the images and links/suburls on a webpage
Stars: ✭ 60 (+76.47%)
Mutual labels:  scrapy

ArticleSpider

通过scrapy,爬取知乎,伯乐在线,拉钩网

注:

这是一个进阶项目,需要有一定的爬虫知识,如果不是很懂基本的爬虫原理,请自行学习一下爬虫基础知识。 我有一个对应的仓库MyPythonForSpider,是一个单线程爬取百度音乐数据的实例,比较适合刚入门的朋友。

这是一个基于web抓取框架scrapy,实现的对于知乎,伯乐在线,拉勾网的爬取。

涉及到的知识点

|-- 基础
|   |-- 正则表达式 [jobbole.py](ArticleSpider/spiders/jobbole.py)
|   |-- xpath (ArticleSpider/spiders/jobbole.py)
|   |-- css选择器 (ArticleSpider/spiders/*.py)
|   `-- ItemLoader
|-- 进阶
|   |-- 图片验证码的处理(ArticleSpider/spiders/lagou.login_after_captcha)
|   |-- ip访问频率限制(ArticleSpider.middlewares.RandomProxyMiddleware)
|   `-- user-agent随机切换(ArticleSpider.middlewares.RandomUserAgentMiddleware)
|-- 高级
|   |-- scrapy的原理
|       `-- 基于scrapy的中间件开发
|   |-- 动态网站的抓取处理
|   |-- 将selenium集成到scrapy中 
|   `-- scrapy log配置
`-- |后续(在此项目中没有体现,后续我将上传此部分代码)
    |-- scrapy-redis
        |-- 分布式爬虫原理
        |-- 分析scrapy-redis源码
        `-- 集成bloomfilter到scrapy-redis中
    `-- Elasticsearch (ArticleSpider.pipelines.ElasticsearchPipeline;)(ArticleSpider.items.JobBoleArticleItem.save_to_es;)
        |-- 安装 elasticsearch-rtf
        |-- 学习使用 elasticsearch-head、kibana
        |-- 学习使用 elasticsearch的Python API: elasticsearch-dsl
        `-- 利用elasticsearch和爬取到的数据+django框架搭建搜索网站(此部分代码将在以后上传)

PS:使用此代码前,需创建mysql数据库,详见ArticleSpider/settings.py

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].