All Projects → WuLC → Thesaurusspider

WuLC / Thesaurusspider

Licence: mit
下载搜狗、百度、QQ输入法的词库文件的 python 爬虫,可用于构建不同行业的词汇库

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Thesaurusspider

Bilibili member crawler
B站用户爬虫 好耶~是爬虫
Stars: ✭ 115 (+17.35%)
Mutual labels:  multithreading, crawler
Examples Of Web Crawlers
一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )
Stars: ✭ 10,724 (+10842.86%)
Mutual labels:  multithreading, crawler
Pixivcrawleriii
A python3 crawler for crawling Pixiv ranking top and any illustrator all artworks
Stars: ✭ 38 (-61.22%)
Mutual labels:  multithreading, crawler
Taiwan News Crawlers
Scrapy-based Crawlers for news of Taiwan
Stars: ✭ 83 (-15.31%)
Mutual labels:  crawler
Ngx Papaparse
Papa Parse wrapper for Angular
Stars: ✭ 83 (-15.31%)
Mutual labels:  multithreading
Capture Thread
Lock-free framework for loggers, tracers, and mockers in multithreaded C++ programs.
Stars: ✭ 93 (-5.1%)
Mutual labels:  multithreading
Infinitycrawler
A simple but powerful web crawler library for .NET
Stars: ✭ 97 (-1.02%)
Mutual labels:  crawler
Napajs
Napa.js: a multi-threaded JavaScript runtime
Stars: ✭ 8,945 (+9027.55%)
Mutual labels:  multithreading
Gf Secrets
Secret and/ credential patterns used for gf.
Stars: ✭ 96 (-2.04%)
Mutual labels:  crawler
Ktspeechcrawler
Automatically constructing corpus for automatic speech recognition from YouTube videos
Stars: ✭ 92 (-6.12%)
Mutual labels:  crawler
Proxy Pool
爬虫代理IP池服务,可供其他爬虫程序通过restapi获取
Stars: ✭ 91 (-7.14%)
Mutual labels:  crawler
Geziyor
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Stars: ✭ 1,246 (+1171.43%)
Mutual labels:  crawler
Scrapoxy
Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting!
Stars: ✭ 1,322 (+1248.98%)
Mutual labels:  crawler
Tumblr crawler
tumblr解析网站
Stars: ✭ 83 (-15.31%)
Mutual labels:  crawler
Lightcrawler
Crawl a website and run it through Google lighthouse
Stars: ✭ 1,339 (+1266.33%)
Mutual labels:  crawler
Acm Statistics
An online tool (crawler) to analyze users performance in online judges (coding competition websites). Supported OJ: POJ, HDU, ZOJ, HYSBZ, CodeForces, UVA, ICPC Live Archive, FZU, SPOJ, Timus (URAL), LeetCode_CN, CSU, LibreOJ, 洛谷, 牛客OJ, Lutece (UESTC), AtCoder, AIZU, CodeChef, El Judge, BNUOJ, Codewars, UOJ, NBUT, 51Nod, DMOJ, VJudge
Stars: ✭ 83 (-15.31%)
Mutual labels:  crawler
Ti.worker
Use Multi-Threading / Worker Threads in Appcelerator Titanium.
Stars: ✭ 95 (-3.06%)
Mutual labels:  multithreading
So 5 5
SObjectizer: it's all about in-process message dispatching!
Stars: ✭ 87 (-11.22%)
Mutual labels:  multithreading
Weibo Album Crawler
新浪微博相册大图多线程爬虫。
Stars: ✭ 83 (-15.31%)
Mutual labels:  crawler
Hotnewsanalysis
利用文本挖掘技术进行新闻热点关注问题分析
Stars: ✭ 93 (-5.1%)
Mutual labels:  crawler

搜狗、百度、QQ输入法词库爬虫

用python实现的爬取搜狗、百度、QQ输入法词库的爬虫。各文件夹对应的内容如下

每个输入法均采用了单线程和多线程实现了爬取功能。多线程的速度要远快于单线程,线程数目建议设为5~10,或者保留默认的设定数5。

通过urllib2、Queue、re、threading等python自带模块实现,无依赖的第三方模块。使用时将singleThreadDownload.py(单线程下载)或 multiThreadDownload.py(多线程下载)中的主函数中的baseDir改为自己的下载路径即可运行单线程下载或多线程下载,注意baseDir末尾没有/。

如果有下载不成功的文件或解析不成功的页面,在下载根目录会生成下载日志,记录这些文件和页面的URL信息,方便debug。

关于实现的具体细节可参考这篇文章

下载的词库文件并非文本格式,而是各个输入法自己定制的二进制格式,关于词库文件的解码并转为文本格式可参考这个repository

2017.01.13更新

百度输入法词库的网页布局已改版,词库的下载链接通过js代码获取,并且采取了一定的反爬虫措施(返回500,502错误)。500, 502表示内部服务器错误,但有的网站在针对爬虫访问的时候也会利用错误码500或502来反爬,百度词库正是这样。

解决方法:

1.虽然下载时通过js代码获取下载链接,但是分析点击下载链接时的http request头中的Request URL,可以发现实际的下载链接还是一个静态链接https://shurufa.baidu.com/dict_innerid_download?innerid=,其中innerid=后跟着的是词库文件的标示ID,可在网页中获取。

2.对于返回500,502错误码的反爬虫措施,通过重新进行请求解决,因为百度词库在返回500或502后会返回一个200,所以实际上并不是服务器出问题,更像是为了反爬而以一定概率出现这类状态码

注意:因为百度输入法采取了一定的反爬虫措施,为了降低返回502,500错误的几率,请求的 user-agent 不再固定,而是采用第三方库user-agent 生成,使用前需要先通过easy_install user-agentpip install user-agent安装。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].