All Projects → Algebra-FUN → WeReadScan

Algebra-FUN / WeReadScan

Licence: other
扫描“微信读书”已购图书并下载本地PDF的爬虫

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to WeReadScan

Pulsar
Turn large Web sites into tables and charts using simple SQLs.
Stars: ✭ 100 (-63.37%)
Mutual labels:  web-crawler, selenium
SchweizerMesser
🎯Python 3 网络爬虫实战、数据分析合集 | 当当 | 网易云音乐 | unsplash | 必胜客 | 猫眼 |
Stars: ✭ 89 (-67.4%)
Mutual labels:  web-crawler, selenium
ant
A web crawler for Go
Stars: ✭ 264 (-3.3%)
Mutual labels:  web-crawler
Sneakers Project
Using Selenium, Neha scraped data about 35 top selling sneakers of Nike and Adidas from stockx.com. She used this data to draw insights about sneaker resales.
Stars: ✭ 32 (-88.28%)
Mutual labels:  selenium
selenium-php
php selenium 数据采集
Stars: ✭ 18 (-93.41%)
Mutual labels:  selenium
frontend testing
Repository containing sample code used in a Frontend Testing workshop
Stars: ✭ 14 (-94.87%)
Mutual labels:  selenium
Monocle
PowerShell Web Automation module, made to make automating websites easier
Stars: ✭ 47 (-82.78%)
Mutual labels:  selenium
phoenix.webui.framework
基于WebDriver的WebUI自动化测试框架
Stars: ✭ 118 (-56.78%)
Mutual labels:  selenium
TeslaPy
A Python module to use the Tesla Motors Owner API
Stars: ✭ 216 (-20.88%)
Mutual labels:  selenium
scrape-youtube-channel-videos-url
This Python script is used to scrape all the video links from a youtube channel.
Stars: ✭ 34 (-87.55%)
Mutual labels:  selenium
selenium-grid-docker-swarm
web scraping in parallel with Selenium Grid and Docker
Stars: ✭ 32 (-88.28%)
Mutual labels:  selenium
Raspagem-de-dados-para-iniciantes
Raspagem de dados para iniciante usando Scrapy e outras libs básicas
Stars: ✭ 113 (-58.61%)
Mutual labels:  web-crawler
justtestlah
Dynamic test framework for web and mobile applications
Stars: ✭ 43 (-84.25%)
Mutual labels:  selenium
carina
Carina automation framework: Web, Mobile, API, DB etc testing...
Stars: ✭ 652 (+138.83%)
Mutual labels:  selenium
RARBG-scraper
With Selenium headless browsing and CAPTCHA solving
Stars: ✭ 38 (-86.08%)
Mutual labels:  selenium
frameworkium-examples
Sample project which utilises frameworkium-core, a framework for writing maintainable Selenium and REST API tests and facilitates reporting and integration to JIRA.
Stars: ✭ 52 (-80.95%)
Mutual labels:  selenium
fBrowser
Helpful Selenium functions to make web-scraping easier and faster
Stars: ✭ 16 (-94.14%)
Mutual labels:  selenium
Python-Studies
All studies about python
Stars: ✭ 56 (-79.49%)
Mutual labels:  selenium
telenium
Automation for Kivy Application
Stars: ✭ 56 (-79.49%)
Mutual labels:  selenium
robotframework-seleniumtestability
Extension for SeleniumLibrary that provides manual and automatic waiting for asyncronous events like fetch, xhr, etc.
Stars: ✭ 34 (-87.55%)
Mutual labels:  selenium

WeReadScan

GitHub last commit GitHub code size in bytes GitHub top language pip

About

一个用于的将微信读书上的图书扫描转换本地PDF的爬虫库.

谈谈为何而开发

不得不说,“微信读书”是一个很好的平台。但是美中不足很明显,用户购买了图书资源,但是只能在“微信读书”的Application中阅读或者做一些文字批注╮(╯▽╰)╭,这些功能相较于购买的纸质书籍显然是不足的。比如,作者就习惯于用iPad的相关notebook类app做笔记,而“微信读书”并没有适配pencil做handwriting笔记的功能。

因此,既然“微信读书”没有提供,那只好自己解决了。于是,经过2天的开发,终于有了这个爬虫脚本,也可以开心地做手写笔记了o( ̄▽ ̄)ブ

相关版本

Sec-ant的建议下,参考了他的解决方案weread-scraper,将其中最重要的获取#preRenderContent的部分脚本进行整合,得到了WeReadScan-HTML版本,可以直接自动化获得多本图书的HTML,更加高效。

Get started

WeReadScan(原始版本)

pip install WeReadScan

WeReadScan-HTML(html-scrape version)

pip install WeReadScan-HTML

使用WeReadScan-HTML这个版本请访问 https://github.com/Algebra-FUN/WeReadScan/tree/html-variant

本项目需要使用selenium,需要对selenium具备基础的了解

Demo

话不多说,直接上代码

from selenium.webdriver import Chrome, ChromeOptions
from WeReadScan import WeRead

# 重要!为webdriver设置headless
chrome_options = ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument('disable-infobars')
chrome_options.add_argument('log-level=3')

# 启动webdriver(--headless)
headless_driver = Chrome(options=chrome_options)

# debug 模式启动,可以保留png缓存
with WeRead(headless_driver,debug=True) as weread:
    # 重要!登陆
    weread.login()
    # 爬去指定url对应的图书资源并保存到当前文件夹
    weread.scan2pdf('https://weread.qq.com/web/reader/2c632ef071a486a92c60226')

扫描结果样例:

几点说明:

  1. webdriver 需要 无头(headless) 模式启动
  2. 只有登陆后,才能扫描完整的图书资源;若不登陆,也可以扫描部分无需解锁的部分

API Reference

WeRead

WeReadScan.WeRead(headless_driver)

微信读书网页代理,用于图书扫描

Args

  • headless_driver: 设置了headless的Webdriver示例

Returns

  • WeReadInstance

Usage

chrome_options = ChromeOptions()
chrome_options.add_argument('--headless')
headless_driver = Chrome(chrome_options=chrome_options)
weread = WeRead(headless_driver)

Login

WeReadScan.WeRead.login(wait_turns=15)

展示二维码以登陆微信读书

Args

  • wait_turns: 登陆二维码等待扫描的等待轮数

Usage

weread.login()

Scan2pdf

WeReadScan.WeRead.scan2pdf(self, book_url, save_at='.', binary_threshold=95, quality=90, show_output=True,font_size_index=1)

扫面微信读书的书籍转换为PDF并保存本地

Args

参数名 类型 默认值 描述
book_url str 必填 扫描目标书籍的URL
save_at str '.' 保存地址
binary_threshold int 200 二值化处理的阈值
quality int 100 扫描PDF的质量
show_output bool True 是否在该方法函数结束时展示生成的PDF文件
font_size_index int 1 设置字号大小(对应微信读书字号)

Usage

weread.scan2pdf('https://weread.qq.com/web/reader/a57325c05c8ed3a57224187kc81322c012c81e728d9d180')

Disclaimer

  • 本脚本仅限用于已购图书的爬取,用于私人学习目的,禁止用于商业目的和网上资源扩散,尊重微信读书方面的利益
  • 若User使用该脚本用于不当的目的,责任由使用者承担,作者概不负责

Stargazers over time

Stargazers over time

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].