All Projects → wen-fei → Cnkispider

wen-fei / Cnkispider

a spider for cnki patent content, just for study and commucation, no use for business.

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Cnkispider

Capturer
capture pictures from website like sina, lofter, huaban and so on
Stars: ✭ 76 (-35.04%)
Mutual labels:  scrapy
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (-14.53%)
Mutual labels:  scrapy
Programer log
最新动态在这里【我的程序员日志】
Stars: ✭ 112 (-4.27%)
Mutual labels:  scrapy
Email Extractor
The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrónicos de una o varias Url
Stars: ✭ 81 (-30.77%)
Mutual labels:  scrapy
Proxy server crawler
an awesome public proxy server crawler based on scrapy framework
Stars: ✭ 94 (-19.66%)
Mutual labels:  scrapy
Scrapyd Cluster On Heroku
Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉
Stars: ✭ 106 (-9.4%)
Mutual labels:  scrapy
Image Downloader
Download images from Google, Bing, Baidu. 谷歌、百度、必应图片下载.
Stars: ✭ 1,173 (+902.56%)
Mutual labels:  scrapy
Patentcrawler
scrapy专利爬虫(停止维护)
Stars: ✭ 114 (-2.56%)
Mutual labels:  scrapy
Experiments
Some research experiments
Stars: ✭ 95 (-18.8%)
Mutual labels:  scrapy
Wswp
Code for the second edition Web Scraping with Python book by Packt Publications
Stars: ✭ 112 (-4.27%)
Mutual labels:  scrapy
Taiwan News Crawlers
Scrapy-based Crawlers for news of Taiwan
Stars: ✭ 83 (-29.06%)
Mutual labels:  scrapy
Scrapoxy
Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting!
Stars: ✭ 1,322 (+1029.91%)
Mutual labels:  scrapy
Crawler
爬虫, http代理, 模拟登陆!
Stars: ✭ 106 (-9.4%)
Mutual labels:  scrapy
Olxscraper
OLX Scraper in Python Scrapy
Stars: ✭ 76 (-35.04%)
Mutual labels:  scrapy
Scrala
Unmaintained 🐳 ☕️ 🕷 Scala crawler(spider) framework, inspired by scrapy, created by @gaocegege
Stars: ✭ 113 (-3.42%)
Mutual labels:  scrapy
Scrapy Examples
Some scrapy and web.py exmaples
Stars: ✭ 71 (-39.32%)
Mutual labels:  scrapy
Decoration Design Crawler
土巴兔和谷居装修网站爬虫
Stars: ✭ 105 (-10.26%)
Mutual labels:  scrapy
Maria Quiteria
Backend para coleta e disponibilização dos dados 📜
Stars: ✭ 115 (-1.71%)
Mutual labels:  scrapy
Weibo hot search
微博爬虫:每天定时爬取微博热搜榜的内容,留下互联网人的记忆。
Stars: ✭ 113 (-3.42%)
Mutual labels:  scrapy
Hive
lots of spider (很多爬虫)
Stars: ✭ 110 (-5.98%)
Mutual labels:  scrapy

CNKISpider

知网专利爬虫,仅用于学习交流,不做商业使用

发现新的爬取入口

今天同学突然告诉我爬取了100多W(我们需要爬2014年的,总共190W+),细问才知道,知网的专利详情页的url组成是有规则的。

举个例子:

http://dbpub.cnki.net/grid2008/dbpub/Detail.aspx?DBName=SCPD2014&FileName=CN203968251U&QueryID=28&CurRec=2

对于这个某个专利的url来说,我们只要变化FileName=CN203968251U就可以了,=号后面代表的是专利公开号,专利公开号亦称专利文献号,组成方式为“国别号+分类号+流水号+标识代码”,如CN1340998A,表示中国的第340998号发明专利(来自百度百科)。

假如我们需要爬取2014年的所有专利,我们可以通过搜索找到2014年1月1日(2014年非常早的一篇专利号)和2014年12月31日(2014年非常晚的一篇专利号),取中间的差值,就可以爬取绝大部分需要的专利了。

其中,CN是固定的,末尾的字母是专利标识代码,中国只有ASU种

所有,避免了爬取url列表页(反爬虫严重)和复杂的验证码问题,直接构建循环爬取详情页即可。

项目使用工具

框架使用的的Scrapy1.3,python版本3.6

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].