Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → wen-fei → Cnkispider

wen-fei / Cnkispider

a spider for cnki patent content, just for study and commucation, no use for business.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

scrapy

Projects that are alternatives of or similar to Cnkispider

Capturer

capture pictures from website like sina, lofter, huaban and so on

Stars: ✭ 76 (-35.04%)

Mutual labels: scrapy

Dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

Stars: ✭ 100 (-14.53%)

Mutual labels: scrapy

Programer log

最新动态在这里【我的程序员日志】

Stars: ✭ 112 (-4.27%)

Mutual labels: scrapy

Email Extractor

The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrónicos de una o varias Url

Stars: ✭ 81 (-30.77%)

Mutual labels: scrapy

Proxy server crawler

an awesome public proxy server crawler based on scrapy framework

Stars: ✭ 94 (-19.66%)

Mutual labels: scrapy

Scrapyd Cluster On Heroku

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉

Stars: ✭ 106 (-9.4%)

Mutual labels: scrapy

Image Downloader

Download images from Google, Bing, Baidu. 谷歌、百度、必应图片下载.

Stars: ✭ 1,173 (+902.56%)

Mutual labels: scrapy

Patentcrawler

scrapy专利爬虫（停止维护）

Stars: ✭ 114 (-2.56%)

Mutual labels: scrapy

Experiments

Some research experiments

Stars: ✭ 95 (-18.8%)

Mutual labels: scrapy

Wswp

Code for the second edition Web Scraping with Python book by Packt Publications

Stars: ✭ 112 (-4.27%)

Mutual labels: scrapy

Taiwan News Crawlers

Scrapy-based Crawlers for news of Taiwan

Stars: ✭ 83 (-29.06%)

Mutual labels: scrapy

Scrapoxy

Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting!

Stars: ✭ 1,322 (+1029.91%)

Mutual labels: scrapy

Crawler

爬虫, http代理, 模拟登陆!

Stars: ✭ 106 (-9.4%)

Mutual labels: scrapy

Olxscraper

OLX Scraper in Python Scrapy

Stars: ✭ 76 (-35.04%)

Mutual labels: scrapy

Scrala

Unmaintained 🐳 ☕️ 🕷 Scala crawler(spider) framework, inspired by scrapy, created by @gaocegege

Stars: ✭ 113 (-3.42%)

Mutual labels: scrapy

Scrapy Examples

Some scrapy and web.py exmaples

Stars: ✭ 71 (-39.32%)

Mutual labels: scrapy

Decoration Design Crawler

土巴兔和谷居装修网站爬虫

Stars: ✭ 105 (-10.26%)

Mutual labels: scrapy

Maria Quiteria

Backend para coleta e disponibilização dos dados 📜

Stars: ✭ 115 (-1.71%)

Mutual labels: scrapy

Weibo hot search

微博爬虫：每天定时爬取微博热搜榜的内容，留下互联网人的记忆。

Stars: ✭ 113 (-3.42%)

Mutual labels: scrapy

Hive

lots of spider (很多爬虫）

Stars: ✭ 110 (-5.98%)

Mutual labels: scrapy

View All Similar Projects ➔

CNKISpider

知网专利爬虫，仅用于学习交流，不做商业使用

发现新的爬取入口

今天同学突然告诉我爬取了100多W（我们需要爬2014年的，总共190W+），细问才知道，知网的专利详情页的url组成是有规则的。

举个例子：

http://dbpub.cnki.net/grid2008/dbpub/Detail.aspx?DBName=SCPD2014&FileName=CN203968251U&QueryID=28&CurRec=2

对于这个某个专利的url来说，我们只要变化FileName=CN203968251U就可以了，=号后面代表的是专利公开号，专利公开号亦称专利文献号，组成方式为“国别号+分类号+流水号+标识代码”，如CN1340998A，表示中国的第340998号发明专利（来自百度百科）。

假如我们需要爬取2014年的所有专利，我们可以通过搜索找到2014年1月1日（2014年非常早的一篇专利号）和2014年12月31日（2014年非常晚的一篇专利号），取中间的差值，就可以爬取绝大部分需要的专利了。

其中，CN是固定的，末尾的字母是专利标识代码，中国只有ASU种

所有，避免了爬取url列表页（反爬虫严重）和复杂的验证码问题，直接构建循环爬取详情页即可。

项目使用工具

框架使用的的Scrapy1.3，python版本3.6

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 117

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗