Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → dantleech → Fink

dantleech / Fink

Licence: mit

PHP Link Checker

Labels

spider

Projects that are alternatives of or similar to Fink

Digger

Digger is a powerful and flexible web crawler implemented by pure golang

Stars: ✭ 130 (-17.2%)

Mutual labels: spider

Venom

All Terrain Autonomous Quadruped

Stars: ✭ 145 (-7.64%)

Mutual labels: spider

Zhihuquestionsspider

😊😊😊 知乎问题爬虫

Stars: ✭ 152 (-3.18%)

Mutual labels: spider

Bilibili User Information Spider

B站3亿用户信息爬虫（mid号，昵称，性别，关注，粉丝，等级）

Stars: ✭ 136 (-13.38%)

Mutual labels: spider

Qiandao

🌟⏳🌟 各种网站的签到（停止维护）

Stars: ✭ 141 (-10.19%)

Mutual labels: spider

Taobaoscrapy

😩Tool For Taobao/Tmall| 儿时玩具已经过时

Stars: ✭ 146 (-7.01%)

Mutual labels: spider

Weibo Topic Spider

微博超级话题爬虫，微博词频统计+情感分析+简单分类，新增肺炎超话爬取数据

Stars: ✭ 128 (-18.47%)

Mutual labels: spider

Scriptspider

一个java版本的分布式的通用爬虫，可以插拔各个组件（提供默认的）

Stars: ✭ 155 (-1.27%)

Mutual labels: spider

Crawler China Mainland Universities

中国大陆大学列表爬虫

Stars: ✭ 143 (-8.92%)

Mutual labels: spider

Jlitespider

A lite distributed Java spider framework :-)

Stars: ✭ 151 (-3.82%)

Mutual labels: spider

Ipproxy

爬虫所需要的IP代理，抓取九个网站的代理IP检测/清洗/入库/更新，添加调用接口

Stars: ✭ 136 (-13.38%)

Mutual labels: spider

Amazonbigspider

😱Full Automatic Amazon Distributed Spider | 亚马逊分布式四国际站采集选款产品|账号admin,密码adminadmin

Stars: ✭ 140 (-10.83%)

Mutual labels: spider

Netease Music Spider

netease-music-spider is a sipder that you can find beautiful girlfriend or handsome boyfriend.

Stars: ✭ 147 (-6.37%)

Mutual labels: spider

Mm131

MM131网站图片爬取 🚨

Stars: ✭ 129 (-17.83%)

Mutual labels: spider

Python3 Spider

Python爬虫实战 - 模拟登陆各大网站包含但不限于：滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝，如果喜欢请start ❤️

Stars: ✭ 2,129 (+1256.05%)

Mutual labels: spider

Guwen Spider

一个完整的nodeJs 串行爬虫抓取3万多个页面。

Stars: ✭ 129 (-17.83%)

Mutual labels: spider

Papa

一个浏览器端数据爬虫，做每个人的数据助手

Stars: ✭ 145 (-7.64%)

Mutual labels: spider

Abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

Stars: ✭ 1,961 (+1149.04%)

Mutual labels: spider

Fp Server

Free proxy server, continuously crawling and providing proxies, based on Tornado and Scrapy. 免费代理服务器，基于Tornado和Scrapy，在本地搭建属于自己的代理池

Stars: ✭ 154 (-1.91%)

Mutual labels: spider

Awesome Web Scraper

A collection of awesome web scaper, crawler.

Stars: ✭ 147 (-6.37%)

Mutual labels: spider

View All Similar Projects ➔

Fink

Fink (pronounced "Phpink") is a command line tool, written in PHP, for checking HTTP links.

Check websites for broken links or error pages.
Asynchronous HTTP requests.

Installation

Install as a stand-alone tool or as a project dependency:

Installing as a project dependency

$ composer require dantleech/fink --dev

Installing from a PHAR

Download the PHAR from the Releases page.

Building your own PHAR with Box

You can build your own PHAR by cloning this repository and running:

$ ./vendor/bin/box compile

Usage

Run the command with a single URL to start crawling:

$ ./vendor/bin/fink https://www.example.com

Use --output=somefile to log verbose information for each URL in JSON format, including:

url: The tested URL.
status: The HTTP status code.
referrer: The page which linked to the URL.
referrer_title: The value (e.g. link title) of the referring element.
referrer_xpath: The path to the node in the referring document.
distance: The number of links away from the start document.
request_time: Number of microseconds taken to make the request.
timestamp: The time that the request was made.
exception: Any runtime exception encountered (e.g. malformed URL, etc).

Arguments

url (multiple) Specify one or more base URLs to crawl (mandatory).

Options

--client-max-body-size: 'Max body size for HTTP client (in bytes).
--client-max-header-size: 'Max header size for HTTP client (in bytes).
--client-redirects=5: Set the maximum number of times the client should redirect (0 to never redirect).
--client-security-level=1: Set the default SSL security level
--client-timeout=15000: Set the maximum amount of time (in milliseconds) the client should wait for a response, defaults to 15,000 (15 seconds).
--concurrency: Number of simultaneous HTTP requests to use.
--display-bufsize=10: Set the number of URLs to consider when showing the display.
--display=+memory: Set, add or remove elements of the runtime display (prefix with - or + to modify the default set).
--exclude-url=logout: (multiple) Exclude URLs matching the given PCRE pattern.
--header="Foo: Bar": (multiple) Specify custom header(s).
--help: Display available options.
--include-link=foobar.html: Include given link as if it were linked from the base URL.
--insecure: Do not verify SSL certificates.
--load-cookies: Load from a cookies.txt.
--max-distance: Maximum allowed distance from base URL (if not specified then there is no limitation).
--max-external-distance: Limit the external (disjoint) distance from the base URL.
--no-dedupe: Do not filter duplicate URLs (can result in a non-terminating process).
--output=out.json: Output JSON report for each URL to given file (truncates existing content).
--publisher=csv: Set the publisher (defaults to json) can be either json or csv.
--rate: Set a maximum number of requests to make in a second.
--stdout: Stream to STDOUT directly, disables display and any specified outfile.

Examples

Crawl a single website

$ fink http://www.example.com --max-external-distance=0

Crawl a single website and check the status of external links

$ fink http://www.example.com --max-external-distance=1

Use `jq` to analyse results

jq is a tool which can be used to query and manipulate JSON data.

$ fink http://www.example.com -x0 -oreport.json

$ cat report.json| jq -c '. | select(.status==404) | {url: .url, referrer: .referrer}' | jq

Crawl pages behind a login

# create a cookies file for later re-use (simulate a login in this case via HTTP-POST)
$ curl -L --cookie-jar mycookies.txt -d username=myLogin -d password=MyP4ssw0rd https://www.example.org/my/login/url

# re-use the cookies file with your fink crawl command
$ fink https://www.example.org/myaccount --load-cookies=mycookies.txt

note: its not possible to create the cookie jar on computer A, store it and read it in again on e.g. a linux server. you need to create the cookie file from the very same ip, because otherwise server side session handling might not continue the http-session because of a IP mismatch

Exit Codes

0: All URLs were successful.
1: Unexpected runtime error.
2: At least one URL failed to resolve successfully.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 157

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (21) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

dantleech / Fink

Labels

Projects that are alternatives of or similar to Fink

Fink

Installation

Installing as a project dependency

Installing from a PHAR

Building your own PHAR with Box

Usage

Arguments

Options

Examples

Crawl a single website

Crawl a single website and check the status of external links

Use jq to analyse results

Crawl pages behind a login

Exit Codes

Use `jq` to analyse results