All Projects → myvyang → Chromium_for_spider

myvyang / Chromium_for_spider

dynamic crawler for web vulnerability scanner

Projects that are alternatives of or similar to Chromium for spider

Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (+65.45%)
Mutual labels:  crawler, spider, puppeteer, chromium
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (-22.27%)
Mutual labels:  crawler, spider, puppeteer
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+2231.36%)
Mutual labels:  crawler, puppeteer, chromium
Ppspider
web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案
Stars: ✭ 237 (+7.73%)
Mutual labels:  crawler, spider, puppeteer
Puppeteer Walker
a puppeteer walker 🕷 🕸
Stars: ✭ 78 (-64.55%)
Mutual labels:  crawler, spider, puppeteer
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Stars: ✭ 182 (-17.27%)
Mutual labels:  crawler, spider
Lianjia Beike Spider
链家网和贝壳网房价爬虫,采集北京上海广州深圳等21个中国主要城市的房价数据(小区,二手房,出租房,新房),稳定可靠快速!支持csv,MySQL, MongoDB,Excel, json存储,支持Python2和3,图表展示数据,注释丰富 ,点星支持,仅供学习参考,请勿用于商业用途,后果自负。
Stars: ✭ 2,257 (+925.91%)
Mutual labels:  crawler, spider
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (-13.64%)
Mutual labels:  crawler, spider
Fooproxy
稳健高效的评分制-针对性- IP代理池 + API服务,可以自己插入采集器进行代理IP的爬取,针对你的爬虫的一个或多个目标网站分别生成有效的IP代理数据库,支持MongoDB 4.0 使用 Python3.7(Scored IP proxy pool ,customise proxy data crawler can be added anytime)
Stars: ✭ 195 (-11.36%)
Mutual labels:  crawler, spider
Phantomas
Headless Chromium-based web performance metrics collector and monitoring tool
Stars: ✭ 2,191 (+895.91%)
Mutual labels:  puppeteer, chromium
Jd mask robot
京东口罩库存监控爬虫(非selenium),扫码登录、查价、加购、下单、秒杀
Stars: ✭ 216 (-1.82%)
Mutual labels:  crawler, spider
Chrome Aws Lambda
Chromium Binary for AWS Lambda and Google Cloud Functions
Stars: ✭ 2,502 (+1037.27%)
Mutual labels:  puppeteer, chromium
Ncov2019 data crawler
疫情数据爬虫,2019新型冠状病毒数据仓库,轨迹数据,同乘数据,报道
Stars: ✭ 175 (-20.45%)
Mutual labels:  crawler, spider
Marmot
💐Marmot | Web Crawler/HTTP protocol Download Package 🐭
Stars: ✭ 186 (-15.45%)
Mutual labels:  crawler, spider
Spoon
🥄 A package for building specific Proxy Pool for different Sites.
Stars: ✭ 173 (-21.36%)
Mutual labels:  crawler, spider
Jvppeteer
Headless Chrome For Java (Java 爬虫)
Stars: ✭ 193 (-12.27%)
Mutual labels:  crawler, puppeteer
Querylist
🕷️ The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
Stars: ✭ 2,392 (+987.27%)
Mutual labels:  crawler, spider
Zhihuspider
多线程知乎用户爬虫,基于python3
Stars: ✭ 201 (-8.64%)
Mutual labels:  crawler, spider
Webvideobot
Web crawler.
Stars: ✭ 214 (-2.73%)
Mutual labels:  crawler, spider
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+6961.36%)
Mutual labels:  crawler, spider

中文版本(Chinese version)

Core functions

By modifying the chromium source code, two functions that are more concerned about in two dynamic crawlers are mainly implemented:

  1. The current page is prohibited from being redirected, and the URL to be redirected is collected for future use. This is achieved by modifying a relatively low-level function, and there is no need to hook in various scenarios.
  2. Hook all non-default events bound by the current page and reserve the scene for subsequent triggers. This way the crawler does not need to traverse all DOM nodes.

There are also some small features:

  1. Disable browsers from downloading files.
  2. Ignore the "X-Frame-Options" header, which allows arbitrary pages to be iframed.
  3. Prohibit alert, print, confirm, prompt popups.
  4. Forbid the page to open a new window by itself; record the URL of the new window to be opened.
  5. Hidden the navigator.webdriver property.

Download

Compiled binary: https://github.com/myvyang/chromium_for_spider/releases

Example

After opening the page, because the page jump is hooked, no matter how you click (or execute any JS), the page should not jump successfully. "URL to be redirected" will be recorded in window.info.

eventNodes

Instructions for use

After implementing the core function by modifying the chromium source code, three properties have been added to the browser's window object:window.info, window.eventNames, andwindow.eventNodes.

Among them, window.info records the jump URL triggered in the page, etc., and uses_-_to separate.

window.eventNames andwindow.eventNodes are used together, eventNames is the event name, such asclick, onmouseover, etc.eventNodes is the DOM node bound to the event, which can be obtained through JS. See ch_test/fireevent.html for usage examples.

Compile

The current (20190517) version of chromium used is dbc6c805b7430f401875d50b8566d9f743ca402b, and the test can be easily compiled successfully. It is possible that some of the dependencies of chromium will be invalidated over time. If it fails, please open an issue reminder to update the chromium version.

Today, the compilation of chromium is very simple. According to the official steps, choose the correct development version (such as dbc6c805b7430f401875d50b8566d9f743ca402b currently used), which can completely achieve no warning throughout the process.

See the official documentation: https://www.chromium.org/developers/how-tos/get-the-code.

  1. Install the official steps first, download the source code and prepare the environment.
  2. git checkout dbc6c805b7430f401875d50b8566d9f743ca402b to switch to the specified version.
  3. gclient sync, this step may report an error. If the error is a module of chromium, delete the module and continue executing this command.
  4. git apply path/to/dbc6c805b7430f401875d50b8566d9f743ca402b.diff apply this patch.
  5. gn args out/Release adds parameters in args.gn (optional, does not affect usability).
  6. gn gen out/Release produces compiled files.
  7. autoninja -C out/Release chrome starts to compile.

The executable file on Mac is src/out/Release/Chromium.app/Contents/MacOS/Chromium, and the executable file on Ubuntu issrc/out/Release/chrome.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].