hackfengJam / ArticleSpider

Licence: other

Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ArticleSpider

Filesensor

Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具

Stars: ✭ 227 (+567.65%)

Mutual labels: scrapy

arche

Analyze scraped data

Stars: ✭ 49 (+44.12%)

Mutual labels: scrapy

crawler

python爬虫项目集合

Stars: ✭ 29 (-14.71%)

Mutual labels: scrapy

Spider job

招聘网数据爬虫

Stars: ✭ 234 (+588.24%)

Mutual labels: scrapy

pagser

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Stars: ✭ 82 (+141.18%)

Mutual labels: scrapy

scrapy-rotated-proxy

A scrapy middleware to use rotated proxy ip list.

Stars: ✭ 22 (-35.29%)

Mutual labels: scrapy

Spiderkeeper

admin ui for scrapy/open source scrapinghub

Stars: ✭ 2,562 (+7435.29%)

Mutual labels: scrapy

Scrape-Finance-Data

My code for scraping financial data in Vietnam

Stars: ✭ 13 (-61.76%)

Mutual labels: scrapy

lgcrawl

python+scrapy+splash 爬取拉勾全站职位信息

Stars: ✭ 22 (-35.29%)

Mutual labels: scrapy

asyncpy

使用asyncio和aiohttp开发的轻量级异步协程web爬虫框架

Stars: ✭ 86 (+152.94%)

Mutual labels: scrapy

Awesome crawl

腾讯新闻、知乎话题、微博粉丝，Tumblr爬虫、斗鱼弹幕、妹子图爬虫、分布式设计等

Stars: ✭ 246 (+623.53%)

Mutual labels: scrapy

domains

World’s single largest Internet domains dataset

Stars: ✭ 461 (+1255.88%)

Mutual labels: scrapy

scrapy helper

Dynamic configurable crawl (动态可配置化爬虫)

Stars: ✭ 84 (+147.06%)

Mutual labels: scrapy

Ecommercecrawlers

码云仓库链接:AJay13/ECommerceCrawlers Github 仓库链接:DropsDevopsOrg/ECommerceCrawlers 项目展示平台链接:http://wechat.doonsec.com

Stars: ✭ 3,073 (+8938.24%)

Mutual labels: scrapy

vietnam-ecommerce-crawler

Crawling the data from lazada, websosanh, compare.vn, cdiscount and cungmua with flexible configs

Stars: ✭ 28 (-17.65%)

Mutual labels: scrapy

Scrapy Splash

Scrapy+Splash for JavaScript integration

Stars: ✭ 2,666 (+7741.18%)

Mutual labels: scrapy

Scrapy-tripadvisor-reviews

Using scrapy to scrape tripadvisor in order to get users' reviews.

Stars: ✭ 24 (-29.41%)

Mutual labels: scrapy

double-agent

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

Stars: ✭ 123 (+261.76%)

Mutual labels: scrapy

scrapy-LBC

Araignée LeBonCoin avec Scrapy et ElasticSearch

Stars: ✭ 14 (-58.82%)

Mutual labels: scrapy

Web-Iota

Iota is a web scraper which can find all of the images and links/suburls on a webpage

Stars: ✭ 60 (+76.47%)

Mutual labels: scrapy

View All Similar Projects ➔

ArticleSpider

通过scrapy，爬取知乎，伯乐在线，拉钩网

注：

这是一个进阶项目，需要有一定的爬虫知识，如果不是很懂基本的爬虫原理，请自行学习一下爬虫基础知识。我有一个对应的仓库MyPythonForSpider，是一个单线程爬取百度音乐数据的实例，比较适合刚入门的朋友。

这是一个基于web抓取框架scrapy，实现的对于知乎，伯乐在线，拉勾网的爬取。

涉及到的知识点

|-- 基础
|   |-- 正则表达式 [jobbole.py](ArticleSpider/spiders/jobbole.py）
|   |-- xpath （ArticleSpider/spiders/jobbole.py）
|   |-- css选择器 （ArticleSpider/spiders/*.py）
|   `-- ItemLoader
|-- 进阶
|   |-- 图片验证码的处理（ArticleSpider/spiders/lagou.login_after_captcha）
|   |-- ip访问频率限制（ArticleSpider.middlewares.RandomProxyMiddleware）
|   `-- user-agent随机切换（ArticleSpider.middlewares.RandomUserAgentMiddleware）
|-- 高级
|   |-- scrapy的原理
|       `-- 基于scrapy的中间件开发
|   |-- 动态网站的抓取处理
|   |-- 将selenium集成到scrapy中 
|   `-- scrapy log配置
`-- |后续(在此项目中没有体现，后续我将上传此部分代码)
    |-- scrapy-redis
        |-- 分布式爬虫原理
        |-- 分析scrapy-redis源码
        `-- 集成bloomfilter到scrapy-redis中
    `-- Elasticsearch （ArticleSpider.pipelines.ElasticsearchPipeline;）(ArticleSpider.items.JobBoleArticleItem.save_to_es;)
        |-- 安装 elasticsearch-rtf
        |-- 学习使用 elasticsearch-head、kibana
        |-- 学习使用 elasticsearch的Python API： elasticsearch-dsl
        `-- 利用elasticsearch和爬取到的数据+django框架搭建搜索网站（此部分代码将在以后上传）

PS：使用此代码前，需创建mysql数据库，详见ArticleSpider/settings.py

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

hackfengJam / ArticleSpider

Programming Languages

Labels

Projects that are alternatives of or similar to ArticleSpider

ArticleSpider

涉及到的知识点