All Projects → itning → DouBanReptile

itning / DouBanReptile

Licence: Apache-2.0 license
豆瓣租房小组多线程爬虫。爬取后自动按时间排序生成markdown文件。

Programming Languages

go
31211 projects - #10 most used programming language
powershell
5483 projects

Projects that are alternatives of or similar to DouBanReptile

tts-deckconverter
Generate card decks for Tabletop Simulator.
Stars: ✭ 27 (-12.9%)
Mutual labels:  fyne
douban
基于thinkphp5.1的豆瓣电影API
Stars: ✭ 106 (+241.94%)
Mutual labels:  douban
Top15
[EOL] 使用 Top15 在你的网站中展示最近看过/读过/听过的电影/书/音乐!
Stars: ✭ 13 (-58.06%)
Mutual labels:  douban
doubanIMDb
IMDb + Rotten Tomatoes + Wikipedia on Douban Movie
Stars: ✭ 93 (+200%)
Mutual labels:  douban
auto-click-auto-fill
Auto Click Auto Fill on any web page
Stars: ✭ 111 (+258.06%)
Mutual labels:  xpath
douban-book-api
第三方豆瓣读书 api 接口
Stars: ✭ 44 (+41.94%)
Mutual labels:  douban
brackit
Query processor with proven optimizations, ready to use for your document store to query semi-structured data with a JSONiq like extension of XQuery. Can also be used as an ad-hoc in-memory query processor.
Stars: ✭ 28 (-9.68%)
Mutual labels:  xpath
douban-movie
Get movie info from douban(豆瓣) and display in your terminal
Stars: ✭ 17 (-45.16%)
Mutual labels:  douban
ToolsCollection
No description or website provided.
Stars: ✭ 20 (-35.48%)
Mutual labels:  douban
DoubanMovieJSON
豆瓣电影JSON数据
Stars: ✭ 60 (+93.55%)
Mutual labels:  douban
python-crawler
爬虫学习仓库,适合零基础的人学习,对新手比较友好
Stars: ✭ 37 (+19.35%)
Mutual labels:  xpath
doubanrobot
A simple robot for Douban.com
Stars: ✭ 34 (+9.68%)
Mutual labels:  douban
shirokumacafe
白熊咖啡馆的豆瓣广播
Stars: ✭ 21 (-32.26%)
Mutual labels:  douban
gosquito
gosquito ("go" + "mosquito") is a pluggable tool for data gathering, data processing and data transmitting to various destinations.
Stars: ✭ 25 (-19.35%)
Mutual labels:  xpath
dotnet-security-unit-tests
A web application that contains several unit tests for the purpose of .NET security
Stars: ✭ 25 (-19.35%)
Mutual labels:  xpath
exml
Most simple Elixir wrapper for xmerl xpath
Stars: ✭ 23 (-25.81%)
Mutual labels:  xpath
fontoxpath
A minimalistic XPath 3.1 implementation in pure JavaScript
Stars: ✭ 97 (+212.9%)
Mutual labels:  xpath
go-xmldom
XML DOM processing for Golang, supports xpath query
Stars: ✭ 38 (+22.58%)
Mutual labels:  xpath
OpenScraper
An open source webapp for scraping: towards a public service for webscraping
Stars: ✭ 80 (+158.06%)
Mutual labels:  xpath
PopClip-Extensions
Extentions I made for PopClip.
Stars: ✭ 17 (-45.16%)
Mutual labels:  douban

豆瓣租房爬虫

GitHub stars GitHub forks GitHub watchers GitHub followers

GitHub issues GitHub license GitHub last commit GitHub release GitHub repo size in bytes HitCount language

下载

https://github.com/itning/DouBanReptile/releases

构建

go build -ldflags="-s -w -H windowsgui" -o ..\bin\main.exe DouBanReptile/cmd

爬取结果文件(markdown)建议使用typora打开

截图

main

main2

a3

a4

markdown

使用教程

确保C:\\Windows\\Fonts\\目录下有simsun.ttc字体文件

e

  1. 如何设置豆瓣群组链接?

    1. 首先搜索某个地区租房,例如:北京租房

      f

    2. 点进去要爬取的某个小组,例如第一个:北京租房

    3. 将页面拉到最下面有个> 更多小组讨论超链接,点进去

      g

    4. 复制地址栏中地址(从/group开始复制到结尾),粘贴到软件设置豆瓣群组链接

      有时候粘贴进软件会崩溃,不知道什么原因,建议把软件中原来的链接删除再粘贴进去。

      h

      i

    5. start=后边的数字50改成%d

      j

    6. 完成

  2. 如何设置排除(包含)关键字?

    排除关键字是标题和内容只要出现关键字就会排除掉该条租房信息。

    例如默认是限女这个关键字,只要租房信息中包含限女生入住只限女生等出现限女关键字的一律不爬。

    多个关键字用|分隔,注意是英文的。

    例如:限女|短租|整租,这三个关键字设置后,只要标题和内容出现这三个关键字软件就不会爬取。

    包含关键字只适用于标题,例如包含关键字为A,标题中含A,但内容中不含,会爬取;内容含A,标题不含,不会爬取。

  3. 关于识别标题中的价格

    使用正则\b\d{4}\b识别标题中的价格信息,无法爬取少于1000元的信息。

  4. 关于爬取结果排序

    先根据价格从小到大排序,价格相同根据发帖时间排序。

  5. 关于爬取结果文件(.md扩展名)如何打开

    建建议下载软件:typora

  6. 如何设置cookie?

    1. 打开豆瓣小组,例如:https://www.douban.com/group/554566/discussion?start=0

    2. F12打开开发者控制台,点击Console控制台选项卡

      cookie1

    3. 输入document.cookie回车,复制内容(注意前后双引号不要复制)

      cookie2

    4. 将复制的内容粘贴在程序中

测试

操作系统 测试结果
windows 7 sp1 OK
windows 10 1909 OK
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].