vict-cn / crawlBaiduWenku

Licence: MIT license

这可能是爬百度文库最全的项目了

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to crawlBaiduWenku

gospider

⚡ Light weight Golang spider framework | 轻量的 Golang 爬虫框架

Stars: ✭ 183 (+190.48%)

Mutual labels: spider

php-crawler

🕷️ A simple crawler (spider) writen in php just for fun, with zero dependencies

Stars: ✭ 39 (-38.1%)

Mutual labels: spider

DeadPool

该项目是一个使用celery作为主体框架的爬虫应用，能够灵活的添加爬虫任务，并且同时运行多站点的爬虫工作，所有组件都能够原生支持规模并发和分布式，加上celery原生的分布式调用，实现大规模并发。

Stars: ✭ 38 (-39.68%)

Mutual labels: spider

weixin article spiders

A spiders' program for weixin which made by Express & cheerio

Stars: ✭ 33 (-47.62%)

Mutual labels: spider

article-spider

文章采集工具 Article collection tool

Stars: ✭ 130 (+106.35%)

Mutual labels: spider

dcard-spider

A spider on Dcard. Strong and speedy.

Stars: ✭ 91 (+44.44%)

Mutual labels: spider

Web-Iota

Iota is a web scraper which can find all of the images and links/suburls on a webpage

Stars: ✭ 60 (-4.76%)

Mutual labels: spider

Novel-crawler

这是一个用Python写的小说爬虫软件

Stars: ✭ 75 (+19.05%)

Mutual labels: spider

gathertool

gathertool是golang脚本化开发库，目的是提高对应场景程序开发的效率；轻量级爬虫库，接口测试&压力测试库，DB操作库等。

Stars: ✭ 36 (-42.86%)

Mutual labels: spider

glyphhanger

Your web font utility belt. It can subset web fonts. It can find unicode-ranges for you automatically. It makes julienne fries.

Stars: ✭ 422 (+569.84%)

Mutual labels: spider

grapy

Grapy, a fast high-level web crawling framework for Python 3.3 or later base on asyncio.

Stars: ✭ 18 (-71.43%)

Mutual labels: spider

ZSpider

基于Electron爬虫程序

Stars: ✭ 37 (-41.27%)

Mutual labels: spider

spider-mzitu

妹子图

Stars: ✭ 13 (-79.37%)

Mutual labels: spider

crawler-chrome-extensions

爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer

Stars: ✭ 53 (-15.87%)

Mutual labels: spider

SpiderCard

蜘蛛纸牌 for mac

Stars: ✭ 29 (-53.97%)

Mutual labels: spider

main project

基于nodejs的网络聊天室、爬虫，vue音乐播放器，及php后台开发的管理系统等项目

Stars: ✭ 49 (-22.22%)

Mutual labels: spider

tuchong Spider

⭐ 图虫网爬虫

Stars: ✭ 16 (-74.6%)

Mutual labels: spider

bet365-websocket-crawler

bet365 bot: bet365的比赛实时比分数据、实时赔率

Stars: ✭ 67 (+6.35%)

Mutual labels: spider

ant

A web crawler for Go

Stars: ✭ 264 (+319.05%)

Mutual labels: spider

sede

Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

Stars: ✭ 83 (+31.75%)

Mutual labels: spider

View All Similar Projects ➔

爬取百度文库

需求是发明之母

想下载文件又不想花钱和积分

如果你和我有一样的想法就往下看，只要几分钟就可以看完，从今以后可以白嫖99%的文库了

使用方法

1. 下载本文档(当然也可以选择不下载)

git clone https://github.com/vict-cn/BaiduWenkuSpider

2. 安装依赖项(如果这些库你都有，也可以不安装)

先用cmd切换到requirements.txt路径

pip install -r https://pypi.tuna.tsinghua.edu.cn/simple -r requirement.txt

3. 下载PhantomJS(本文档自带)

然后将其添加到环境变量，新手请点击

因为selenium高版本不支持PhantomJS了，所以我们这里选择安装低版本的selenium

4. 运行crawlBaiduWenku.py文件

此时你就会得到你想要的(大概率是可行的)，要是爬取的不太理想，请继续阅读

使用说明(若是不想看文字，可以直接看example文件夹内的实例，或者直接看各文件的作用)

爬TXT文件，爬取效果最好，可以选择可以使用prase_to_txt.py文件或者parse_to_doc.py文件，有时候后者比前者的效果要好，会生成一个txt/doc文件。查看TXT实例

爬PPT文件，会生成一个文件夹，里面有PPT的所有图片，要是想直接生成PPT，运行pic_to_ppt.py，想生成pdf,可以运行pic_to_pdf。查看PPT实例

爬pdf文件，速度较慢，会生成一个文件夹，里面有PDF的所有图片加上合起来的PDF文件（文件清晰度不是很高，这个有待提高）。查看PDF实例

爬xls文件，若是xls中有表格时(xls难道不就是表格文件？里面不都是表格？有时候还真不是),会生成一个文件内有表格的图片，里面都是表格图片（有时候图片会是分散的），要想生成xls文件，需要导入百度的识别表格API（此处留个链接）。若xls中全是文字的话，运行Screenshot_to_pdf.py文件，生成图片（速度有点慢），然后用pic_to_txt.py文件生成txt文件。查看XLS实例

爬doc文件，大难题就是它，若是纯文本则可以直接运行parse_to_doc.py文件(效果还不错)，若是图片少的时候，直接运行parse_to_doc.py文件生成doc文件，然后稍加修改，若是图片多的时候，可以考虑运行Screenshot_to_pdf.py文件，来生成截图。查看DOC实例

各文件的作用

crawlBaiduWenku.py

可以爬取 TXT / PDF / DOC / XLS /PPT 文件，生成对应的文件。

parse_to_txt.py

可以爬取TXT / PDF / DOC / XLS 文件，生成txt文件。

爬取TXT文件的效果最好。

parse_to_doc.py

可以爬取TXT / PDF / DOC / XLS 文件，生成doc文件。

爬取DOC与TXT文件的效果最好(有时候爬TXT简直不要太好。)

Screenshot_to_pdf.py

可以爬 TXT / PDF / DOC / XLS /PPT 文件，生成对应的截图还有合起来的pdf文件。

对所有文件有用，缺点是清晰度不是很好，但是能够看的清楚。

pic_to_pdf.py

把文件夹内的图片转化为pdf文件。

pic_to_xls.py

把文件夹内的表格转化为xls文件。需要baidu-aip

parse_to_pic.py

获得每个页面的所有图片(有的文档可能解析不出来)。

pic_to_txt.py

把图片中的文字转化到txt中(含位置)，需要baidu-aip

pic_to_ppt.py

生成ppt文件，包含文件夹内的所有图片(默认为一个图片覆盖一个幻灯片)

声明

**除非选择下载文件，否则很难得到与原来一模一样的文件，而爬取方法没有定式，比如爬TXT并非一定要用parse_to_txt.py，还有大把方法可以使用。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

vict-cn / crawlBaiduWenku

Programming Languages

Labels

Projects that are alternatives of or similar to crawlBaiduWenku

爬取百度文库

需求是发明之母

使用方法

使用说明(若是不想看文字，可以直接看example文件夹内的实例，或者直接看各文件的作用)

各文件的作用

声明