Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting!

Stars: ✭ 1,322 (+1248.98%)

Mutual labels: crawler

Tumblr crawler

tumblr解析网站

Stars: ✭ 83 (-15.31%)

Mutual labels: crawler

Lightcrawler

Crawl a website and run it through Google lighthouse

Stars: ✭ 1,339 (+1266.33%)

Mutual labels: crawler

Acm Statistics

An online tool (crawler) to analyze users performance in online judges (coding competition websites). Supported OJ: POJ, HDU, ZOJ, HYSBZ, CodeForces, UVA, ICPC Live Archive, FZU, SPOJ, Timus (URAL), LeetCode_CN, CSU, LibreOJ, 洛谷, 牛客OJ, Lutece (UESTC), AtCoder, AIZU, CodeChef, El Judge, BNUOJ, Codewars, UOJ, NBUT, 51Nod, DMOJ, VJudge

Stars: ✭ 83 (-15.31%)

Mutual labels: crawler

Ti.worker

Use Multi-Threading / Worker Threads in Appcelerator Titanium.

Stars: ✭ 95 (-3.06%)

Mutual labels: multithreading

So 5 5

SObjectizer: it's all about in-process message dispatching!

Stars: ✭ 87 (-11.22%)

Mutual labels: multithreading

Weibo Album Crawler

新浪微博相册大图多线程爬虫。

Stars: ✭ 83 (-15.31%)

Mutual labels: crawler

Hotnewsanalysis

利用文本挖掘技术进行新闻热点关注问题分析

Stars: ✭ 93 (-5.1%)

Mutual labels: crawler

View All Similar Projects ➔

搜狗、百度、QQ输入法词库爬虫

用python实现的爬取搜狗、百度、QQ输入法词库的爬虫。各文件夹对应的内容如下

BaiduTheaurusSpifer：百度输入法的词库爬虫
QQTheaurusSpider: QQ输入法的词库爬虫
SougouThesaurusSpider：搜狗输入法的词库爬虫

每个输入法均采用了单线程和多线程实现了爬取功能。多线程的速度要远快于单线程，线程数目建议设为5~10，或者保留默认的设定数5。

通过urllib2、Queue、re、threading等python自带模块实现，无依赖的第三方模块。使用时将singleThreadDownload.py（单线程下载）或 multiThreadDownload.py(多线程下载)中的主函数中的baseDir改为自己的下载路径即可运行单线程下载或多线程下载,注意baseDir末尾没有/。

如果有下载不成功的文件或解析不成功的页面，在下载根目录会生成下载日志，记录这些文件和页面的URL信息，方便debug。

关于实现的具体细节可参考这篇文章。

下载的词库文件并非文本格式，而是各个输入法自己定制的二进制格式，关于词库文件的解码并转为文本格式可参考这个repository。

2017.01.13更新

百度输入法词库的网页布局已改版，词库的下载链接通过js代码获取，并且采取了一定的反爬虫措施(返回500，502错误)。500， 502表示内部服务器错误，但有的网站在针对爬虫访问的时候也会利用错误码500或502来反爬，百度词库正是这样。

解决方法:

1.虽然下载时通过js代码获取下载链接，但是分析点击下载链接时的http request头中的Request URL,可以发现实际的下载链接还是一个静态链接https://shurufa.baidu.com/dict_innerid_download?innerid=,其中innerid=后跟着的是词库文件的标示ID，可在网页中获取。

2.对于返回500,502错误码的反爬虫措施，通过重新进行请求解决，因为百度词库在返回500或502后会返回一个200，所以实际上并不是服务器出问题，更像是为了反爬而以一定概率出现这类状态码

注意：因为百度输入法采取了一定的反爬虫措施，为了降低返回502,500错误的几率，请求的 user-agent 不再固定，而是采用第三方库user-agent 生成，使用前需要先通过easy_install user-agent或pip install user-agent安装。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 98

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗