Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → yokonsan → Dingdian

yokonsan / Dingdian

Licence: mit

Python爬虫和Flask实现小说网站

Programming Languages

python

139335 projects - #7 most used programming language

python3

1442 projects

Labels

spider flask-application

Projects that are alternatives of or similar to Dingdian

Text Sherlock

Text (source code) search engine with indexer and a front end web interface to search. Uses Python 3.

Stars: ✭ 103 (-10.43%)

Mutual labels: flask-application

Hive

lots of spider (很多爬虫）

Stars: ✭ 110 (-4.35%)

Mutual labels: spider

Scrala

Unmaintained 🐳 ☕️ 🕷 Scala crawler(spider) framework, inspired by scrapy, created by @gaocegege

Stars: ✭ 113 (-1.74%)

Mutual labels: spider

Nl2lf

The Resources for "Natural Language to Logical Form" ; "自然语言转逻辑形式"研究资料收集。

Stars: ✭ 105 (-8.7%)

Mutual labels: spider

Not Your Average Web Crawler

A web crawler (for bug hunting) that gathers more than you can imagine.

Stars: ✭ 107 (-6.96%)

Mutual labels: spider

Pokr.kr

Pokr [ˈpō-kər] - Politics in Korea (Out of service since Dec 2018)

Stars: ✭ 111 (-3.48%)

Mutual labels: flask-application

Ruia

Async Python 3.6+ web scraping micro-framework based on asyncio

Stars: ✭ 1,366 (+1087.83%)

Mutual labels: spider

Douban Movie

Golang爬虫爬取豆瓣电影Top250

Stars: ✭ 114 (-0.87%)

Mutual labels: spider

White

A Blog Cms Website backed by MySQL in Flask&Python

Stars: ✭ 108 (-6.09%)

Mutual labels: flask-application

Pkulaw spider

爬取北大法宝网http://www.pkulaw.cn/Case/

Stars: ✭ 113 (-1.74%)

Mutual labels: spider

Skycaiji

蓝天采集器是一款免费的数据采集发布爬虫软件，采用php+mysql开发，可部署在云服务器，几乎能采集所有类型的网页，无缝对接各类CMS建站程序，免登录实时发布数据，全自动无需人工干预！是网页大数据采集软件中完全跨平台的云端爬虫系统

Stars: ✭ 1,514 (+1216.52%)

Mutual labels: spider

Crawler Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

Stars: ✭ 1,549 (+1246.96%)

Mutual labels: spider

Baiduspider

BaiduSpider，一个爬取百度搜索结果的爬虫，目前支持百度网页搜索，百度图片搜索，百度知道搜索，百度视频搜索，百度资讯搜索，百度文库搜索，百度经验搜索和百度百科搜索。

Stars: ✭ 105 (-8.7%)

Mutual labels: spider

Animesearcher

整合第三方网站的视频和弹幕资源, 为白嫖党提供最佳看番追剧体验

Stars: ✭ 101 (-12.17%)

Mutual labels: spider

Douyin Api

抖音API、抖音数据、抖音直播数据、抖音直播Api、抖音视频Api、抖音爬虫、抖音去水印、抖音视频下载、抖音视频解析、抖音直播监控、抖音数据采集

Stars: ✭ 112 (-2.61%)

Mutual labels: spider

Pspider

一个简单的分布式爬虫框架

Stars: ✭ 102 (-11.3%)

Mutual labels: spider

Jobs Search

🕷招聘网站爬虫合集，不定期更新分支

Stars: ✭ 111 (-3.48%)

Mutual labels: spider

Bilibili member crawler

B站用户爬虫好耶~是爬虫

Stars: ✭ 115 (+0%)

Mutual labels: spider

Geetest

滑动验证码，希望对你们有所帮助❤️

Stars: ✭ 114 (-0.87%)

Mutual labels: spider

Cockroach

又一个 java 内容（pa）获取（chong）工具

Stars: ✭ 112 (-2.61%)

Mutual labels: spider

View All Similar Projects ➔

dingdian

说明

由于顶点网站进行了一次更新，这次项目也进行一次大的更新。基本上是推翻前一次所有的实现方法。

还在为网上小说网站广告弹窗而烦恼吗，自己写一个吧。

爬虫实现

~~利用正则表达式加requests库，抓取顶点网的小说数据。~~

由于re匹配数据速度太慢了，改用xpath和requests库，抓取顶点网的小说数据。

爬虫api调用：

搜索结果页：DdSpider().get_index_result(search, page=0)
小说章节页：DdSpider().get_chapter(book_url)
章节内容：DdSpider().get_article(chapter_url)

由于正常搜索，需要的最符合的结果都会显示在第一页，所以爬虫设成了默认只抓第一页。不过jinja2模版中加了下一页和上一页的按钮，爬虫会根据具体第几页抓取，不会一次性抓取太多影响运行速度。

爬虫封装在DdSpider类中，如果网站再次更新，只要改动DdSpider就可以了。

FLask

每次启动爬虫由SQLAlchemy数据库保存数据，加快再次访问速度。

过段时间可能有还在连载的小说会有更新那么需要清空数据库（取消掉manage.py清空数据库的注释），再启动。

本地运行

$ pip install -r requirements.txt
$ python manage.py db upgrade
$ python manage.py runserver --host 0.0.0.0

部署

利用Gunicorn部署在heroku，具体参考这里here

不过自己记得在仓库push你的migrations/，还有就是我的manage的deploy被我改了（push到远程服务器出现更新数据库错误），所以大家需要将他改为更新数据库的代码：

@manager.command
def deploy():
    from flask_migrate import upgrade
    # 更新
    upgrade()

然后部署步骤不变。

访问：Mynovels

查看详情

知乎传送门

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 115

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗