All Projects → duoan → codes-scratch-crawler

duoan / codes-scratch-crawler

Licence: Apache-2.0 License
读书笔记《自己动手写网络爬虫》,自己敲的代码。主要记录了网络爬虫的基本实现,网页去重的算法,网页指纹算法,文本信息挖掘

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to codes-scratch-crawler

ptt-web-crawler
PTT 網路版爬蟲
Stars: ✭ 20 (-54.55%)
Mutual labels:  crawler
CrawlBox
Easy way to brute-force web directory.
Stars: ✭ 118 (+168.18%)
Mutual labels:  crawler
bots-zoo
No description or website provided.
Stars: ✭ 59 (+34.09%)
Mutual labels:  crawler
TripAdvisor-Crawling-Suite
Fetching hotel data from TripAdvisor.
Stars: ✭ 17 (-61.36%)
Mutual labels:  crawler
Crawling-CV-Conference-Papers
Crawling CV conference papers with Python.
Stars: ✭ 32 (-27.27%)
Mutual labels:  crawler
slime
🍰 一个可视化的爬虫平台
Stars: ✭ 27 (-38.64%)
Mutual labels:  crawler
domfind
A Python DNS crawler to find identical domain names under different TLDs.
Stars: ✭ 22 (-50%)
Mutual labels:  crawler
html-query
A fluent and functional approach to querying HTML
Stars: ✭ 48 (+9.09%)
Mutual labels:  crawler
2017 PyConTW Talk
tw.pycon.org/2017/events/talk/314386410792550475/
Stars: ✭ 18 (-59.09%)
Mutual labels:  crawler
TumblTwo
TumblTwo, an Improved Fork of TumblOne, a Tumblr Downloader.
Stars: ✭ 57 (+29.55%)
Mutual labels:  crawler
lostark-wait-notifier
🐤️ Lost Ark wait notifier
Stars: ✭ 38 (-13.64%)
Mutual labels:  crawler
WeiboCrawler
无cookie版微博爬虫,可以连续爬取一个或多个新浪微博用户信息、用户微博及其微博评论转发。
Stars: ✭ 45 (+2.27%)
Mutual labels:  crawler
WebCrawler
一个轻量级、快速、多线程、多管道、灵活配置的网络爬虫。
Stars: ✭ 39 (-11.36%)
Mutual labels:  crawler
spiderable-middleware
🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks
Stars: ✭ 29 (-34.09%)
Mutual labels:  crawler
snapcrawl
Crawl a website and take screenshots
Stars: ✭ 37 (-15.91%)
Mutual labels:  crawler
desktop
TurboWarp as a desktop app
Stars: ✭ 69 (+56.82%)
Mutual labels:  scratch
videodl
Videodl: A lightweight video downloader written by pure python.
Stars: ✭ 320 (+627.27%)
Mutual labels:  crawler
indieweb-search
Source code for the IndieWeb search engine.
Stars: ✭ 16 (-63.64%)
Mutual labels:  crawler
ZhengFang System Spider
🐛一只登录正方教务管理系统,爬取数据的小爬虫
Stars: ✭ 21 (-52.27%)
Mutual labels:  crawler
dijnet-bot
Az összes számlád még egy helyen :)
Stars: ✭ 17 (-61.36%)
Mutual labels:  crawler

爬虫相关知识代码

读书笔记《自己动手写网络爬虫》,自己敲的代码。主要记录了网络爬虫的基本实现,网页去重的算法,网页指纹算法,文本信息挖掘

  • ConsistentHash 一致hash算法

  • HashAlgorithms hash算法大全

  • MurmurHash MurMurHash算法,是非加密HASH算法,性能很高,碰撞率低

  • IPSeeker 封装了腾讯的ip库,提供一些工具,读取QQwry.dat文件,以根据ip获得好友位置

  • HITS HITS算法实现

  • PageRank PageRank算法实现

  • WebGraph Web图建模

  • WebGraphMemory 内存Web图

  • SimpleBloomFilter 布隆过滤器

  • BDBFrontier 使用Berkeley DB 来做爬虫的前端url爬取列表存储

  • Crawler 爬虫一只,采用了宽度优先的方式爬取网络,并且使用httpclien4.3来下载网页

  • CrawlUrl 一个封装了爬虫的url地址的对象,可以使用其layer变量控制限制层次的爬取

  • DownLoadFile 一个下载网页数据到本地的工具类

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].