IAmStoxe / Urlgrab
A golang utility to spider through a website searching for additional links.
Stars: ✭ 285
Programming Languages
go
31211 projects - #10 most used programming language
Labels
Projects that are alternatives of or similar to Urlgrab
Java Spider
一个基于webmagic框架二次开发的java爬虫框架实战,已实现能爬取腾讯,搜狐,今日头条(单独集成功能)等资讯内容,配合elasticsearch框架用法,实现了自动爬虫,已投入线上生产使用。
Stars: ✭ 276 (-3.16%)
Mutual labels: spider
ip proxy pool
Generating spiders dynamically to crawl and check those free proxy ip on the internet with scrapy.
Stars: ✭ 39 (-86.32%)
Mutual labels: spider
Hacker News Digest
📰 A responsive interface of Hacker News with summaries and thumbnails.
Stars: ✭ 278 (-2.46%)
Mutual labels: spider
galer
A fast tool to fetch URLs from HTML attributes by crawl-in.
Stars: ✭ 138 (-51.58%)
Mutual labels: spider
bocfx
中国银行外汇牌价爬虫 / API (Bank of China - Foreign Exchange - Spider/ API)
Stars: ✭ 30 (-89.47%)
Mutual labels: spider
Dpspider
大众点评爬虫、API,可以进行单独城市、单独地区、单独商铺的爬取、搜索、多类型地区搜索、信息获取、提供MongoDB数据库存储支持,可以进行点评文本解密的爬取、存储
Stars: ✭ 259 (-9.12%)
Mutual labels: spider
PttImageSpider
PTT 圖片下載器 (抓取整個看板的圖片,並用文章標題作為資料夾的名稱 ) (使用Scrapy)
Stars: ✭ 16 (-94.39%)
Mutual labels: spider
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-2.81%)
Mutual labels: spider
TwEater
A Python Bot for Scraping Conversations from Twitter
Stars: ✭ 16 (-94.39%)
Mutual labels: spider
Crawlertutorial
爬蟲極簡教學(fetch, parse, search, multiprocessing, API)- PTT 為例
Stars: ✭ 282 (-1.05%)
Mutual labels: spider
Alltheplaces
A set of spiders and scrapers to extract location information from places that post their location on the internet.
Stars: ✭ 277 (-2.81%)
Mutual labels: spider
Happy Spiders
🔧 🔩 🔨 收集整理了爬虫相关的工具、模拟登陆技术、代理IP、scrapy模板代码等内容。
Stars: ✭ 261 (-8.42%)
Mutual labels: spider
Welcome to urlgrab 👋
A golang utility to spider through a website searching for additional links with support for JavaScript rendering.
Install
go get -u github.com/iamstoxe/urlgrab
Features
- Customizable Parallelism
- Ability to Render JavaScript (including Single Page Applications such as Angular and React)
Usage
Usage of urlgrab:
-cache-dir string
Specify a directory to utilize caching. Works between sessions as well.
-debug
Extremely verbose debugging output. Useful mainly for development.
-delay int
Milliseconds to randomly apply as a delay between requests. (default 2000)
-depth int
The maximum limit on the recursion depth of visited URLs. (default 2)
-headless
If true the browser will be displayed while crawling.
Note: Requires render-js flag
Note: Usage to show browser: --headless=false (default true)
-ignore-query
Strip the query portion of the URL before determining if we've visited it yet.
-ignore-ssl
Scrape pages with invalid SSL certificates
-js-timeout int
The amount of seconds before a request to render javascript should timeout. (default 10)
-json string
The filename where we should store the output JSON file.
-max-body int
The limit of the retrieved response body in kilobytes.
0 means unlimited.
Supply this value in kilobytes. (i.e. 10 * 1024kb = 10MB) (default 10240)
-no-head
Do not send HEAD requests prior to GET for pre-validation.
-output-all string
The directory where we should store the output files.
-proxy string
The SOCKS5 proxy to utilize (format: socks5://127.0.0.1:8080 OR http://127.0.0.1:8080).
Supply multiple proxies by separating them with a comma.
-random-agent
Utilize a random user agent string.
-render-js
Determines if we utilize a headless chrome instance to render javascript.
-root-domain string
The root domain we should match links against.
If not specified it will default to the host of --url.
Example: --root-domain google.com
-threads int
The number of threads to utilize. (default 5)
-timeout int
The amount of seconds before a request should timeout. (default 10)
-url string
The URL where we should start crawling.
-urls string
A file path that contains a list of urls to supply as starting urls.
Requires --root-domain flag.
-user-agent string
A user agent such as (Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0).
-verbose
Verbose output
Build
You can easily build a binary specific to your platform into the bin
directory with th following command:
make build
if you want to make binaries for Windows, Linux and MacOS to distribute the CLI, just run this command:
make cross
All the binaries will be available in the dist
directory.
Author
👤 Devin Stokes
- Twitter: @DevinStokes
- Github: @IAmStoxe
🤝 Contributing
Contributions, issues and feature requests are welcome!
Feel free to check issues page.
Show your support
Give a ⭐ if this project helped you!
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].