All Projects → dytttf → Antispider

dytttf / Antispider

Programming Languages

javascript
184084 projects - #8 most used programming language
python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Antispider

Work crawler
Download comics novels 小说漫画下载工具 小説漫画のダウンローダ 小說漫畫下載:腾讯漫画 大角虫漫画 有妖气 知音漫客 咪咕 SF漫画 哦漫画 看漫画 漫画柜 汗汗酷漫 動漫伊甸園 快看漫画 微博动漫 733动漫网 大古漫画网 漫画DB 無限動漫 動漫狂 卡推漫画 动漫之家 动漫屋 古风漫画网 36漫画网 亲亲漫画网 乙女漫画 comico webtoons 咚漫 ニコニコ静画 ComicWalker ヤングエースUP モアイ pixivコミック サイコミ;アルファポリス カクヨム ハーメルン 小説家になろう 起点中文网 八一中文网 顶点小说 落霞小说网 努努书坊 笔趣阁→epub.
Stars: ✭ 1,224 (+1136.36%)
Mutual labels:  crawler
Ktspeechcrawler
Automatically constructing corpus for automatic speech recognition from YouTube videos
Stars: ✭ 92 (-7.07%)
Mutual labels:  crawler
Infinitycrawler
A simple but powerful web crawler library for .NET
Stars: ✭ 97 (-2.02%)
Mutual labels:  crawler
Acm Statistics
An online tool (crawler) to analyze users performance in online judges (coding competition websites). Supported OJ: POJ, HDU, ZOJ, HYSBZ, CodeForces, UVA, ICPC Live Archive, FZU, SPOJ, Timus (URAL), LeetCode_CN, CSU, LibreOJ, 洛谷, 牛客OJ, Lutece (UESTC), AtCoder, AIZU, CodeChef, El Judge, BNUOJ, Codewars, UOJ, NBUT, 51Nod, DMOJ, VJudge
Stars: ✭ 83 (-16.16%)
Mutual labels:  crawler
Weibo Album Crawler
新浪微博相册大图多线程爬虫。
Stars: ✭ 83 (-16.16%)
Mutual labels:  crawler
Scrapoxy
Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting!
Stars: ✭ 1,322 (+1235.35%)
Mutual labels:  crawler
Swiftlinkpreview
It makes a preview from an URL, grabbing all the information such as title, relevant texts and images.
Stars: ✭ 1,216 (+1128.28%)
Mutual labels:  crawler
Gopa Abandoned
GOPA, a spider written in Go.(NOTE: this project moved to https://github.com/infinitbyte/gopa )
Stars: ✭ 98 (-1.01%)
Mutual labels:  crawler
Proxy Pool
爬虫代理IP池服务,可供其他爬虫程序通过restapi获取
Stars: ✭ 91 (-8.08%)
Mutual labels:  crawler
Scaleable Crawler With Docker Cluster
a scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine
Stars: ✭ 96 (-3.03%)
Mutual labels:  crawler
Taiwan News Crawlers
Scrapy-based Crawlers for news of Taiwan
Stars: ✭ 83 (-16.16%)
Mutual labels:  crawler
Geziyor
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Stars: ✭ 1,246 (+1158.59%)
Mutual labels:  crawler
Gf Secrets
Secret and/ credential patterns used for gf.
Stars: ✭ 96 (-3.03%)
Mutual labels:  crawler
Is Google
Verify that a request is from Google crawlers using Google's DNS verification steps
Stars: ✭ 82 (-17.17%)
Mutual labels:  crawler
Amazonrobot
Amazon商品引流的 python 爬虫
Stars: ✭ 97 (-2.02%)
Mutual labels:  crawler
Wombat
Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
Stars: ✭ 1,220 (+1132.32%)
Mutual labels:  crawler
Hotnewsanalysis
利用文本挖掘技术进行新闻热点关注问题分析
Stars: ✭ 93 (-6.06%)
Mutual labels:  crawler
Douyinsdk
抖音 SDK,数据采集,爬虫抓取不是梦
Stars: ✭ 99 (+0%)
Mutual labels:  crawler
Thesaurusspider
下载搜狗、百度、QQ输入法的词库文件的 python 爬虫,可用于构建不同行业的词汇库
Stars: ✭ 98 (-1.01%)
Mutual labels:  crawler
Lightcrawler
Crawl a website and run it through Google lighthouse
Stars: ✭ 1,339 (+1252.53%)
Mutual labels:  crawler

antispider

记录一下碰到过的反爬虫措施和解决办法,欢迎交流!!!

第二级目录无限制


首次访问会出现js中间页跳转 估计是验证ip


页面加载时间特长


discuz论坛板块接口


需要验证referer


js跳转 changde.py


cookie加密验证天眼查 test_down_tianyancha.py


逗比验证码+%99验证失败

http://xygs.gsaic.gov.cn/gsxygs/pub!list.do


豆瓣FM及其他豆瓣网站 https 不严密的cookie参数 test_down_douban.py

js执行后url增加_dsign参数 get_dsign.py

访问显示安全检查中... 5秒后经过js跳转到正常页面

文字使用css样式代替

限制访问频率以及代理类型

  • https://m.guazi.com/bj/dazhong/
  • 访问频率要小于 0.5次/s
  • 如果使用代理的话 http协议要用http协议的代理 https要用https的代理,混用的话相当于没加代理

巧妙使用\r在不同平台的差异让爬虫开发者头疼

  • \r在linux下会被解释为回车,如果使用\r当做换行符,在网页和windows上显示都没有问题,但在linux下输出的时候测绘覆盖\r之前的字符,导致输出结果和网页上看到的少很多。。,如果不太明白\r含义的话,想必要调试很久很久很久很久吧。。。

爬虫技巧-西瓜视频MP4下载地址获取

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].