Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

豆瓣电影top250、斗鱼爬取json数据以及爬取美女图片、淘宝、有缘、CrawlSpider爬取红娘网相亲人的部分基本信息以及红娘网分布式爬取和存储redis、爬虫小demo、Selenium、爬取多点、django开发接口、爬取有缘网信息、模拟知乎登录、模拟github登录、模拟图虫网登录、爬取多点商城整站数据、爬取微信公众号历史文章、爬取微信群或者微信好友分享的文章、itchat监听指定微信公众号分享的文章

Stars: ✭ 615 (+408.26%)

Mutual labels: spider, selenium

weibo topic

微博话题关键词,个人微博采集, 微博博文一键删除 selenium获取cookie,requests处理

Stars: ✭ 28 (-76.86%)

Mutual labels: spider, selenium

SchweizerMesser

🎯Python 3 网络爬虫实战、数据分析合集 | 当当 | 网易云音乐 | unsplash | 必胜客 | 猫眼 |

Stars: ✭ 89 (-26.45%)

Mutual labels: spider, selenium

Netdiscovery

NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。

Stars: ✭ 573 (+373.55%)

Mutual labels: spider, selenium

Infospider

INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰，旨在安全快捷的帮助用户拿回自己的数据，工具代码开源，流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。

Stars: ✭ 5,984 (+4845.45%)

Mutual labels: spider, selenium

Alipayspider Scrapy

AlipaySpider on Scrapy(use chrome driver); 支付宝爬虫(基于Scrapy)

Stars: ✭ 70 (-42.15%)

Mutual labels: spider, selenium

Seleniumcrawler

An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site

Stars: ✭ 117 (-3.31%)

Mutual labels: selenium

Csharp.webdriver

Browser test automation using Selenium WebDriver in C#

Stars: ✭ 115 (-4.96%)

Mutual labels: selenium

Dingdian

Python爬虫和Flask实现小说网站

Stars: ✭ 115 (-4.96%)

Mutual labels: spider

Seleniumjavacourse

Selenium Java Code for all selenium sessions - WebDriver, TestNG, POI, etc...

Stars: ✭ 115 (-4.96%)

Mutual labels: selenium

30 Days Of Python

Learn Python for the next 30 (or so) Days.

Stars: ✭ 1,748 (+1344.63%)

Mutual labels: selenium

Decryptlogin

APIs for loginning some websites by using requests.

Stars: ✭ 1,861 (+1438.02%)

Mutual labels: spider

Query Selector Shadow Dom

querySelector that can pierce Shadow DOM roots without knowing the path through nested shadow roots. Useful for automated testing of Web Components. Production use is not advised, this is for test environments/tools such as Web Driver, Playwright, Puppeteer

Stars: ✭ 115 (-4.96%)

Mutual labels: selenium

Bilibili member crawler

B站用户爬虫好耶~是爬虫

Stars: ✭ 115 (-4.96%)

Mutual labels: spider

Douban Movie

Golang爬虫爬取豆瓣电影Top250

Stars: ✭ 114 (-5.79%)

Mutual labels: spider

View All Similar Projects ➔

拼多多爬虫

更新

selenium 爬取被识别问题

在我发完这篇博客后，有很多朋友也尝试了我github上的代码。后来我发现，拼多多增加了一些反爬策略，我的代码已经被拼多多的反爬策略过滤了。作为一个好学的同学，我当然要深入研究一下啦。

首先，selenium+geckodriver 是通过模拟火狐浏览器访问的，以此欺骗目标网站就好像是人为点击的一样。可是当我再跑我的代码时，发现人工点击和selenium效果是不一样的，当使用selenium模拟时，不断会出现错误界面。经过查询，selenium在运行的时候会暴露出一些预定义的Javascript变量（特征字符串），例如"window.navigator.webdriver"，在非selenium环境下其值为undefined，而在selenium环境下，其值为true（如下图所示为selenium驱动下Chrome控制台打印出的值）。当然，还有其他很多变量，大家可以看看这篇文章。

那么我们重新理清思路，我们通过selenium模拟点击并连接代理，从代理中截取商品数据。而拼多多通过js文件判断我们是否使用selenium，并且将判断结果发送给服务器，控制返回内容。我们很难找到判断结果是以何种方式发送给服务器的。但我们可以从代理中截取该js文件，改变其内容，将判断selenium在js中预设的变量的部分删除掉就行了。

所以我在新代码中添加了一些代码：

if 'react_psnl_verification_' in response.request.path:
	js_body = str(response.get_body_data(), 'utf-8')
	js_body =  js_body.replace("navigator.webdriver", "navigator.qwerasdfzxcv")
	response.set_body_data(bytes(js_body, 'utf-8'))

评论无法全部爬取问题

拼多多对于较多评论的商品只会展示部分，所以本项目只能爬取所有已知商品的可展示评论数据。

应最近一个项目需求，爬取拼多多数据。目前已经爬到90万+的商品数据。

目标

所有商品。
所有评论。
附带的用户信息。
项目需要用到的信息

已完成

所有商品
评论

所用依赖

拼多多没有网页端，爬取的是移动端搜索栏中的分类。因为是移动端，可以拿到返回商品的API，可是无法破解URL中的anticontent的字段，导致无法重放URL。综合以上特性所以就没有使用scrapy一类的框架。

商品的爬取是使用selenium结合代理，从代理中获取返回api中的商品信息。

代理使用的是@qiyeboy的开源项目BaseProxy

问题

验证码问题

经测验，访问次数到达一定的时候会出现验证码。普通orc识别效果并不好，选择使用了一种网络打码平台。优化访问后五六分钟一次验证码。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 121

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (8) 🔗