Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

zhangslob / Docs

《数据采集从入门到放弃》源码。内容简介：爬虫介绍、就业情况、爬虫工程师面试题；HTTP协议介绍； Requests使用；解析器Xpath介绍； MongoDB与MySQL；多线程爬虫； Scrapy介绍；Scrapy-redis介绍；使用docker部署；使用nomad管理docker集群；使用EFK查询docker日志

Programming Languages

python

139335 projects - #7 most used programming language

Labels

docker mysql http mongodb crawler scrapy requests xpath

Projects that are alternatives of or similar to Docs

Python Spider

豆瓣电影top250、斗鱼爬取json数据以及爬取美女图片、淘宝、有缘、CrawlSpider爬取红娘网相亲人的部分基本信息以及红娘网分布式爬取和存储redis、爬虫小demo、Selenium、爬取多点、django开发接口、爬取有缘网信息、模拟知乎登录、模拟github登录、模拟图虫网登录、爬取多点商城整站数据、爬取微信公众号历史文章、爬取微信群或者微信好友分享的文章、itchat监听指定微信公众号分享的文章

Stars: ✭ 615 (+421.19%)

Mutual labels: scrapy, xpath, mysql, mongodb

python-crawler

爬虫学习仓库，适合零基础的人学习，对新手比较友好

Stars: ✭ 37 (-68.64%)

Mutual labels: requests, xpath, scrapy

Pythonstudy

Python related technologies used in work: crawler, data analysis, timing tasks, RPC, page parsing, decorator, built-in functions, Python objects, multi-threading, multi-process, asynchronous, redis, mongodb, mysql, openstack, etc.

Stars: ✭ 103 (-12.71%)

Mutual labels: xpath, mysql, mongodb

Spider python

python爬虫

Stars: ✭ 557 (+372.03%)

Mutual labels: scrapy, xpath, requests

Easy Scraping Tutorial

Simple but useful Python web scraping tutorial code.

Stars: ✭ 583 (+394.07%)

Mutual labels: crawler, scrapy, requests

Scrapingoutsourcing

ScrapingOutsourcing专注分享爬虫代码尽量每周更新一个

Stars: ✭ 164 (+38.98%)

Mutual labels: crawler, scrapy, requests

Bilibili member crawler

B站用户爬虫好耶~是爬虫

Stars: ✭ 115 (-2.54%)

Mutual labels: crawler, mysql, requests

Price Monitor

京东商品价格监控：监控用户设定商品价格，降价邮件/微信提醒。技术：Python爬虫/IP代理池/JS接口爬取/Selenium页面爬取

Stars: ✭ 634 (+437.29%)

Mutual labels: crawler, mysql, requests

Distributed Multi User Scrapy System With A Web Ui

Django based application that allows creating, deploying and running Scrapy spiders in a distributed manner

Stars: ✭ 88 (-25.42%)

Mutual labels: scrapy, mongodb

Scrapoxy

Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting!

Stars: ✭ 1,322 (+1020.34%)

Mutual labels: crawler, scrapy

Decryptlogin

APIs for loginning some websites by using requests.

Stars: ✭ 1,861 (+1477.12%)

Mutual labels: crawler, requests

Go Sniffer

🔎Sniffing and parsing mysql,redis,http,mongodb etc protocol. 抓包截取项目中的数据库请求并解析成相应的语句。

Stars: ✭ 1,281 (+985.59%)

Mutual labels: mysql, mongodb

Weibo Album Crawler

新浪微博相册大图多线程爬虫。

Stars: ✭ 83 (-29.66%)

Mutual labels: crawler, requests

Adminer Custom

Customizations for Adminer, the best database management tool written in PHP.

Stars: ✭ 99 (-16.1%)

Mutual labels: mysql, mongodb

Taiwan News Crawlers

Scrapy-based Crawlers for news of Taiwan

Stars: ✭ 83 (-29.66%)

Mutual labels: crawler, scrapy

Spring Boot 2.x Examples

Spring Boot 2.x code examples

Stars: ✭ 104 (-11.86%)

Mutual labels: mysql, mongodb

Laravel Log To Db

Custom Laravel and Lumen 5.6+ Log channel handler that can store log events to SQL or MongoDB databases. Uses Laravel/Monolog native logging functionality.

Stars: ✭ 76 (-35.59%)

Mutual labels: mysql, mongodb

Dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

Stars: ✭ 100 (-15.25%)

Mutual labels: crawler, scrapy

Crawler

爬虫, http代理, 模拟登陆!

Stars: ✭ 106 (-10.17%)

Mutual labels: crawler, scrapy

Graphquery

GraphQuery is a query language and execution engine tied to any backend service.

Stars: ✭ 112 (-5.08%)

Mutual labels: crawler, xpath

View All Similar Projects ➔

数据采集从入门到放弃

内容介绍

本书会介绍我目前所知的所有关于爬虫的东西，更像是我的技能清单，仔细把其中所有的内容过一遍，目标是传播知识。

在想阅读：数据采集从入门到放弃

大概会分为这么几个大方向：

爬虫介绍、就业情况、爬虫工程师面试题
HTTP协议介绍
Requests使用
解析器Xpath介绍
MongoDB与MySQL
多线程爬虫
Scrapy介绍
Scrapy-redis介绍
使用docker部署
使用nomad管理docker集群
使用EFK查询docker日志

可能还会增加一些别的，主要是看心情。如：

简单验证码处理（这个我也在学）
IOS逆向
Chrome断点调试和加密分析
Docker使用
Selenium与Appnium、pyppeteer
布隆过滤器
Charles、mitmproxy抓包
全站爬取思路
Flask开发
Spark相关
其他语言如Go、JAVA爬虫

这其中的每一点都需要花很多时间去研究，希望我们一起进步。

我不会讲Python基础语法那些，建议去BeginnersGuide 和 documentation 看。

开发环境

Python3系列
建议macOS或Linux系统
PyCharm开发

说说标题

先解释下标题，为什么是入门到放弃。

首先这并不是一句调侃的话，而是我现在的内心感受。我做爬虫快两年了，是从运营转过来的。我觉得我对爬虫有这三个阶段：

喜欢。刚开始还没有真正接触到真实企业需求时，由于知乎的渲染（你懂得），我对爬虫真的超级感兴趣，打开的每个新网站都想去试试如何爬取，有什么反爬没。这个阶段持续到开始做实际项目，就慢慢地转变为下个阶段。这里我想说下，肯定有别人和我一样对爬虫保持有很高的热情，喜欢去爬取一些网站的数据，有一个关键点就是数据的问题。很多时候数据不完整，或者数据不持久，没有持续的数据分析，你爬取的数据就是没有价值的，这是我做了几个长期项目的感受。
无感。爱好变为职业是一个很痛苦的事情，之前做运营时超级羡慕爬虫工程师们，感觉他们好幸福。当自己真正开始做了，刚开始还是挺好的，过一年心态就会发生变化，原因很多，这个有时间再慢慢说吧。这首歌就是红玫瑰：得不到的永远在骚动，被偏爱的都有恃无恐，自行体会吧。
放弃。阶段二与阶段三是同时会有的感受，因为对爬虫没有之前那么多兴趣，就会慢慢的想开始去做别的事情。我的博客中的描述是“数据采集、数据处理、机器学习”，数据采集知识第一步，数据处理、机器学习才是重点（高薪职业），是未来有前景的方向。所以我才会去学Spark，去学Scala，也是希望在未来的某个时候可以转行去真正接触“数据”，研究数据。

个人介绍

我叫小歪，公众号：Python爬虫与算法进阶 ，知乎上也叫小歪。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 118

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗