All Projects → IAmStoxe → Urlgrab

IAmStoxe / Urlgrab

A golang utility to spider through a website searching for additional links.

Programming Languages

go
31211 projects - #10 most used programming language

Labels

Projects that are alternatives of or similar to Urlgrab

ZhengFang System Spider
🐛一只登录正方教务管理系统,爬取数据的小爬虫
Stars: ✭ 21 (-92.63%)
Mutual labels:  spider
Zsky
DHT磁力链接magnet BT搜索引擎,纯Python开发
Stars: ✭ 256 (-10.18%)
Mutual labels:  spider
Java Spider
一个基于webmagic框架二次开发的java爬虫框架实战,已实现能爬取腾讯,搜狐,今日头条(单独集成功能)等资讯内容,配合elasticsearch框架用法,实现了自动爬虫,已投入线上生产使用。
Stars: ✭ 276 (-3.16%)
Mutual labels:  spider
ip proxy pool
Generating spiders dynamically to crawl and check those free proxy ip on the internet with scrapy.
Stars: ✭ 39 (-86.32%)
Mutual labels:  spider
ShuWo Spider
图书馆书蜗App自动化脚本(抢坐 & 续借)
Stars: ✭ 14 (-95.09%)
Mutual labels:  spider
Dumpall
一款信息泄漏利用工具,适用于.git/.svn源代码泄漏和.DS_Store泄漏
Stars: ✭ 250 (-12.28%)
Mutual labels:  spider
toutiao
今日头条科技新闻接口爬虫
Stars: ✭ 17 (-94.04%)
Mutual labels:  spider
Hacker News Digest
📰 A responsive interface of Hacker News with summaries and thumbnails.
Stars: ✭ 278 (-2.46%)
Mutual labels:  spider
galer
A fast tool to fetch URLs from HTML attributes by crawl-in.
Stars: ✭ 138 (-51.58%)
Mutual labels:  spider
Bt Btt
磁力網站U3C3介紹以及域名更新
Stars: ✭ 261 (-8.42%)
Mutual labels:  spider
Douban Crawler
Uno Crawler por https://douban.com
Stars: ✭ 13 (-95.44%)
Mutual labels:  spider
bocfx
中国银行外汇牌价爬虫 / API (Bank of China - Foreign Exchange - Spider/ API)
Stars: ✭ 30 (-89.47%)
Mutual labels:  spider
Dpspider
大众点评爬虫、API,可以进行单独城市、单独地区、单独商铺的爬取、搜索、多类型地区搜索、信息获取、提供MongoDB数据库存储支持,可以进行点评文本解密的爬取、存储
Stars: ✭ 259 (-9.12%)
Mutual labels:  spider
PttImageSpider
PTT 圖片下載器 (抓取整個看板的圖片,並用文章標題作為資料夾的名稱 ) (使用Scrapy)
Stars: ✭ 16 (-94.39%)
Mutual labels:  spider
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-2.81%)
Mutual labels:  spider
TwEater
A Python Bot for Scraping Conversations from Twitter
Stars: ✭ 16 (-94.39%)
Mutual labels:  spider
Tieba spider
百度贴吧爬虫(基于scrapy和mysql)
Stars: ✭ 257 (-9.82%)
Mutual labels:  spider
Crawlertutorial
爬蟲極簡教學(fetch, parse, search, multiprocessing, API)- PTT 為例
Stars: ✭ 282 (-1.05%)
Mutual labels:  spider
Alltheplaces
A set of spiders and scrapers to extract location information from places that post their location on the internet.
Stars: ✭ 277 (-2.81%)
Mutual labels:  spider
Happy Spiders
🔧 🔩 🔨 收集整理了爬虫相关的工具、模拟登陆技术、代理IP、scrapy模板代码等内容。
Stars: ✭ 261 (-8.42%)
Mutual labels:  spider

Welcome to urlgrab 👋

Twitter: DevinStokes

A golang utility to spider through a website searching for additional links with support for JavaScript rendering.

Install

go get -u github.com/iamstoxe/urlgrab

Features

  • Customizable Parallelism
  • Ability to Render JavaScript (including Single Page Applications such as Angular and React)

Usage

Usage of urlgrab:
  -cache-dir string
        Specify a directory to utilize caching. Works between sessions as well.
  -debug
        Extremely verbose debugging output. Useful mainly for development.
  -delay int
        Milliseconds to randomly apply as a delay between requests. (default 2000)
  -depth int
        The maximum limit on the recursion depth of visited URLs.  (default 2)
  -headless
        If true the browser will be displayed while crawling.
        Note: Requires render-js flag
        Note: Usage to show browser: --headless=false (default true)
  -ignore-query
        Strip the query portion of the URL before determining if we've visited it yet.
  -ignore-ssl
        Scrape pages with invalid SSL certificates
  -js-timeout int
        The amount of seconds before a request to render javascript should timeout. (default 10)
  -json string
        The filename where we should store the output JSON file.
  -max-body int
        The limit of the retrieved response body in kilobytes.
        0 means unlimited.
        Supply this value in kilobytes. (i.e. 10 * 1024kb = 10MB) (default 10240)
  -no-head
        Do not send HEAD requests prior to GET for pre-validation.
  -output-all string
        The directory where we should store the output files.
  -proxy string
        The SOCKS5 proxy to utilize (format: socks5://127.0.0.1:8080 OR http://127.0.0.1:8080).
        Supply multiple proxies by separating them with a comma.
  -random-agent
        Utilize a random user agent string.
  -render-js
        Determines if we utilize a headless chrome instance to render javascript.
  -root-domain string
        The root domain we should match links against.
        If not specified it will default to the host of --url.
        Example: --root-domain google.com
  -threads int
        The number of threads to utilize. (default 5)
  -timeout int
        The amount of seconds before a request should timeout. (default 10)
  -url string
        The URL where we should start crawling.
  -urls string
        A file path that contains a list of urls to supply as starting urls.
        Requires --root-domain flag.
  -user-agent string
        A user agent such as (Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0).
  -verbose
        Verbose output

Build

You can easily build a binary specific to your platform into the bin directory with th following command:

make build

if you want to make binaries for Windows, Linux and MacOS to distribute the CLI, just run this command:

make cross

All the binaries will be available in the dist directory.

Author

👤 Devin Stokes

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐ if this project helped you!

Buy Me A Coffee

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].