Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → IAmStoxe → Urlgrab

IAmStoxe / Urlgrab

A golang utility to spider through a website searching for additional links.

Programming Languages

31211 projects - #10 most used programming language

Labels

spider

Projects that are alternatives of or similar to Urlgrab

ZhengFang System Spider

🐛一只登录正方教务管理系统，爬取数据的小爬虫

Stars: ✭ 21 (-92.63%)

Mutual labels: spider

Zsky

DHT磁力链接magnet BT搜索引擎，纯Python开发

Stars: ✭ 256 (-10.18%)

Mutual labels: spider

Java Spider

一个基于webmagic框架二次开发的java爬虫框架实战，已实现能爬取腾讯，搜狐，今日头条（单独集成功能）等资讯内容，配合elasticsearch框架用法，实现了自动爬虫，已投入线上生产使用。

Stars: ✭ 276 (-3.16%)

Mutual labels: spider

ip proxy pool

Generating spiders dynamically to crawl and check those free proxy ip on the internet with scrapy.

Stars: ✭ 39 (-86.32%)

Mutual labels: spider

ShuWo Spider

图书馆书蜗App自动化脚本（抢坐 & 续借）

Stars: ✭ 14 (-95.09%)

Mutual labels: spider

Dumpall

一款信息泄漏利用工具，适用于.git/.svn源代码泄漏和.DS_Store泄漏

Stars: ✭ 250 (-12.28%)

Mutual labels: spider

toutiao

今日头条科技新闻接口爬虫

Stars: ✭ 17 (-94.04%)

Mutual labels: spider

Hacker News Digest

📰 A responsive interface of Hacker News with summaries and thumbnails.

Stars: ✭ 278 (-2.46%)

Mutual labels: spider

galer

A fast tool to fetch URLs from HTML attributes by crawl-in.

Stars: ✭ 138 (-51.58%)

Mutual labels: spider

Bt Btt

磁力網站U3C3介紹以及域名更新

Stars: ✭ 261 (-8.42%)

Mutual labels: spider

Douban Crawler

Uno Crawler por https://douban.com

Stars: ✭ 13 (-95.44%)

Mutual labels: spider

bocfx

中国银行外汇牌价爬虫 / API (Bank of China - Foreign Exchange - Spider/ API)

Stars: ✭ 30 (-89.47%)

Mutual labels: spider

Dpspider

大众点评爬虫、API，可以进行单独城市、单独地区、单独商铺的爬取、搜索、多类型地区搜索、信息获取、提供MongoDB数据库存储支持，可以进行点评文本解密的爬取、存储

Stars: ✭ 259 (-9.12%)

Mutual labels: spider

PttImageSpider

PTT 圖片下載器 (抓取整個看板的圖片，並用文章標題作為資料夾的名稱 ) (使用Scrapy)

Stars: ✭ 16 (-94.39%)

Mutual labels: spider

Gopa

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

Stars: ✭ 277 (-2.81%)

Mutual labels: spider

TwEater

A Python Bot for Scraping Conversations from Twitter

Stars: ✭ 16 (-94.39%)

Mutual labels: spider

Tieba spider

百度贴吧爬虫(基于scrapy和mysql)

Stars: ✭ 257 (-9.82%)

Mutual labels: spider

Crawlertutorial

爬蟲極簡教學（fetch, parse, search, multiprocessing, API）- PTT 為例

Stars: ✭ 282 (-1.05%)

Mutual labels: spider

Alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet.

Stars: ✭ 277 (-2.81%)

Mutual labels: spider

Happy Spiders

🔧 🔩 🔨 收集整理了爬虫相关的工具、模拟登陆技术、代理IP、scrapy模板代码等内容。

Stars: ✭ 261 (-8.42%)

Mutual labels: spider

View All Similar Projects ➔

Welcome to urlgrab 👋

A golang utility to spider through a website searching for additional links with support for JavaScript rendering.

Install

go get -u github.com/iamstoxe/urlgrab

Features

Customizable Parallelism
Ability to Render JavaScript ^{(including Single Page Applications such as Angular and React)}

Usage

Usage of urlgrab:
  -cache-dir string
        Specify a directory to utilize caching. Works between sessions as well.
  -debug
        Extremely verbose debugging output. Useful mainly for development.
  -delay int
        Milliseconds to randomly apply as a delay between requests. (default 2000)
  -depth int
        The maximum limit on the recursion depth of visited URLs.  (default 2)
  -headless
        If true the browser will be displayed while crawling.
        Note: Requires render-js flag
        Note: Usage to show browser: --headless=false (default true)
  -ignore-query
        Strip the query portion of the URL before determining if we've visited it yet.
  -ignore-ssl
        Scrape pages with invalid SSL certificates
  -js-timeout int
        The amount of seconds before a request to render javascript should timeout. (default 10)
  -json string
        The filename where we should store the output JSON file.
  -max-body int
        The limit of the retrieved response body in kilobytes.
        0 means unlimited.
        Supply this value in kilobytes. (i.e. 10 * 1024kb = 10MB) (default 10240)
  -no-head
        Do not send HEAD requests prior to GET for pre-validation.
  -output-all string
        The directory where we should store the output files.
  -proxy string
        The SOCKS5 proxy to utilize (format: socks5://127.0.0.1:8080 OR http://127.0.0.1:8080).
        Supply multiple proxies by separating them with a comma.
  -random-agent
        Utilize a random user agent string.
  -render-js
        Determines if we utilize a headless chrome instance to render javascript.
  -root-domain string
        The root domain we should match links against.
        If not specified it will default to the host of --url.
        Example: --root-domain google.com
  -threads int
        The number of threads to utilize. (default 5)
  -timeout int
        The amount of seconds before a request should timeout. (default 10)
  -url string
        The URL where we should start crawling.
  -urls string
        A file path that contains a list of urls to supply as starting urls.
        Requires --root-domain flag.
  -user-agent string
        A user agent such as (Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0).
  -verbose
        Verbose output

Build

You can easily build a binary specific to your platform into the bin directory with th following command:

make build

if you want to make binaries for Windows, Linux and MacOS to distribute the CLI, just run this command:

make cross

All the binaries will be available in the dist directory.

Author

👤 Devin Stokes

Twitter: @DevinStokes
Github: @IAmStoxe

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐ if this project helped you!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 285

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (7) 🔗