All Projects → adamdehaven → fetchurls

adamdehaven / fetchurls

Licence: MIT license
A bash script to spider a site, follow links, and fetch urls (with built-in filtering) into a generated text file.

Programming Languages

shell
77523 projects

Projects that are alternatives of or similar to fetchurls

wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (-46.39%)
Mutual labels:  spider, wget, crawl
Grab Site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Stars: ✭ 680 (+601.03%)
Mutual labels:  spider, crawl
Infospider
INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰,旨在安全快捷的帮助用户拿回自己的数据,工具代码开源,流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。
Stars: ✭ 5,984 (+6069.07%)
Mutual labels:  spider, crawl
Pspider
一个简单的分布式爬虫框架
Stars: ✭ 102 (+5.15%)
Mutual labels:  spider, crawl
Geetest
geetest,滑动验证码
Stars: ✭ 293 (+202.06%)
Mutual labels:  spider, crawl
Zhihu Login
知乎模拟登录,支持提取验证码和保存 Cookies
Stars: ✭ 340 (+250.52%)
Mutual labels:  spider, crawl
Novel Plus
小说精品屋-plus是一个多端(PC、WAP)阅读、功能完善的原创文学CMS系统,由前台门户系统、作家后台管理系统、平台后台管理系统、爬虫管理系统等多个子系统构成,支持多模版、会员充值、订阅模式、新闻发布和实时统计报表等功能,新书自动入库,老书自动更新。
Stars: ✭ 1,122 (+1056.7%)
Mutual labels:  spider, crawl
Nodespider
[DEPRECATED] Simple, flexible, delightful web crawler/spider package
Stars: ✭ 33 (-65.98%)
Mutual labels:  spider, crawl
Python3 Spider
Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️
Stars: ✭ 2,129 (+2094.85%)
Mutual labels:  spider, crawl
Proxy pool
Python爬虫代理IP池(proxy pool)
Stars: ✭ 13,964 (+14295.88%)
Mutual labels:  spider, crawl
Crack Js Spider
破解JS反爬虫加密参数,已破解中国裁判文书网(2020-06-30更新),淘宝密码,天安保险登录,b站登录,房天下登录,WPS登录,微博登录,有道翻译,网易登录,微信公众号登录,空中网登录,今目标登录,学生信息管理系统登录,共赢金融登录,重庆科技资源共享平台登录,网易云音乐下载,一键解析视频链接,财联社登录。
Stars: ✭ 175 (+80.41%)
Mutual labels:  spider, crawl
Scrapy IPProxyPool
免费 IP 代理池。Scrapy 爬虫框架插件
Stars: ✭ 100 (+3.09%)
Mutual labels:  spider, crawl
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (-45.36%)
Mutual labels:  spider, crawl
Fbcrawl
A Facebook crawler
Stars: ✭ 536 (+452.58%)
Mutual labels:  spider, crawl
Bitextor
Bitextor generates translation memories from multilingual websites.
Stars: ✭ 168 (+73.2%)
Mutual labels:  wget, crawl
Geetest
滑动验证码,希望对你们有所帮助❤️
Stars: ✭ 114 (+17.53%)
Mutual labels:  spider, crawl
gospider
⚡ Light weight Golang spider framework | 轻量的 Golang 爬虫框架
Stars: ✭ 183 (+88.66%)
Mutual labels:  spider, crawl
gathertool
gathertool是golang脚本化开发库,目的是提高对应场景程序开发的效率;轻量级爬虫库,接口测试&压力测试库,DB操作库等。
Stars: ✭ 36 (-62.89%)
Mutual labels:  spider, crawl
crawlBaiduWenku
这可能是爬百度文库最全的项目了
Stars: ✭ 63 (-35.05%)
Mutual labels:  spider
TikTokDownloader PyWebIO
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音|TikTok数据爬取工具,支持API调用,在线批量解析及下载。
Stars: ✭ 919 (+847.42%)
Mutual labels:  spider

fetchurls

A bash script to spider a site, follow links, and fetch urls (with built-in filtering) into a generated text file.

Usage

  1. Download the script and save to the desired location on your machine.

  2. You'll need wget installed on your machine.

    To check if it is already installed, try running the command wget by itself.

    If you are on a Mac or running Linux, chances are you already have wget installed; however, if the wget command is not working, it may not be properly added to your PATH variable.

    If you are running Windows:

    1. Download the lastest wget binary for windows from https://eternallybored.org/misc/wget/

      The download is available as a zip with documentation, or just an exe. I'd recommend just the exe.

    2. If you downloaded the zip, extract all (if windows built in zip utility gives an error, use 7-zip). In addition, if you downloaded the 64-bit version, rename the wget64.exe file to wget.exe

    3. Move wget.exe to C:\Windows\System32\

  3. Ensure the version of grep on your computer supports -E, --extended-regexp. To check for support, run grep --help and look for the flag. To check the installed version, run grep -V.

  4. Open Git Bash, Terminal, etc. and set execute permissions for the fetchurls.sh script:

    chmod +x /path/to/script/fetchurls.sh
  5. Enter the following to run the script:

    ./fetchurls.sh [OPTIONS]...

    Alternatively, you may execute with either of the following:

    sh ./fetchurls.sh [OPTIONS]...
    
    # -- OR -- #
    
    bash ./fetchurls.sh [OPTIONS]...

If you do not pass any options, the script will run in interactive mode.

If the domain URL requires authentication, you must pass the username and password as flags; you are not prompted for these values in interactive mode.

Options

You may pass options (as flags) directly to the script, or pass nothing to run the script in interactive mode.

domain

  • Usage: -d, --domain
  • Example: https://example.com

The fully qualified domain URL (with protocol) you would like to crawl.

Ensure that you enter the correct protocol (e.g. https) and subdomain for the URL or the generated file may be empty or incomplete. The script will automatically attempt to follow the first HTTP redirect, if found. For example, if you enter the incorrect protocol (http://...) for https://www.adamdehaven.com, the script will automatically follow the redirect and fetch all URLs for the correct HTTPS protocol.

The domain's URLs will be successfully spidered as long as the target URL (or the first redirect) returns a status of HTTP 200 OK.

location

  • Usage: -l, --location
  • Default: ~/Desktop
  • Example: /c/Users/username/Desktop

The location (directory) where you would like to save the generated results.

If the directory does not exist at the specified location, as long as the rest of the path is valid, the new directory will automatically be created.

filename

  • Usage: -f, --filename
  • Default: domain-topleveldomain
  • Example: example-com

The desired name of the generated file, without spaces or file extension.

exclude

Pipe-delimited list of file extensions to exclude from results.

To prevent excluding files matching the default list of file extensions, simply pass an empty string ""

sleep

  • Usage: -s, --sleep
  • Default: 0
  • Example: 2

The number of seconds to wait between retrievals.

username

  • Usage: -u, --username
  • Example: marty_mcfly

If the domain URL requires authentication, the username to pass to the wget command.

If the username contains space characters, you must pass inside quotes. This value may only be set with a flag; there is no prompt in interactive mode.

password

  • Usage: -p, --password
  • Example: thats_heavy

If the domain URL requires authentication, the password to pass to the wget command.

If the password contains space characters, you must pass inside quotes. This value may only be set with a flag; there is no prompt in interactive mode.

non-interactive

  • Usage: -n, --non-interactive

Allows the script to run successfully in a non-interactive shell.

The script will utilize the default --location and --filename settings unless the respective flags are explicitely set.

ignore-robots

  • Usage: -i, --ignore-robots

Ignore robots.txt for the domain.

wget

  • Usage: -w, --wget

Show wget install instructions. The installation instructions may vary depending on your computer's configuration.

version

  • Usage: -v, -V, --version

Show version information.

troubleshooting

  • Usage: -t, --troubleshooting

Outputs received option flags with their associated values at runtime for troubleshooting.

help

  • Usage: -h, -?, --help

Show the help content.

Interactive Mode

If you do not pass the --domain flag, the script will run in interactive mode and you will be prompted for the unset options.

First, you will be prompted to enter the full URL (including HTTPS/HTTP protocol) of the site you would like to crawl:

Fetch a list of unique URLs for a domain.

Enter the full domain URL ( http://example.com )
Domain URL:

You will then be prompted to enter the location (directory) of where you would like the generated results to be saved (defaults to Desktop on Windows):

Save file to directory
Directory: /c/Users/username/Desktop

Next, you will be prompted to change/accept the name of the generated file (simply press enter to accept the default filename):

Save file as
Filename (no file extension, and no spaces): example-com

Finally, you will be prompted to change/accept the default list of excluded file extensions (press enter to accept the default list):

Exclude files with matching extensions
Excluded extensions: bmp|css|doc|docx|gif|jpeg|jpg|JPG|js|map|pdf|PDF|png|ppt|pptx|svg|ts|txt|xls|xlsx|xml

The script will crawl the site and compile a list of valid URLs into a new text file. When complete, the script will show a message and the location of the generated file:

Fetching URLs for example.com

Finished with 1 result!

File Location:
/c/Users/username/Desktop/example-com.txt

If a file of the same name already exists at the location (e.g. if you previously ran the script for the same URL), the original file will be overwritten.

Excluded Files and Directories

The script, by default, filters out many file extensions that are commonly not needed.

The list of file extensions can be passed via the --exclude flag, or provided via the interactive mode.

Excluded Files

  • .bmp
  • .css
  • .doc
  • .docx
  • .gif
  • .jpeg
  • .jpg
  • .JPG
  • .js
  • .map
  • .pdf
  • .PDF
  • .png
  • .ppt
  • .pptx
  • .svg
  • .ts
  • .txt
  • .xls
  • .xlsx
  • .xml

Excluded Directories

In addition, specific site (including WordPress) files and directories are also ignored.

  • /wp-content/uploads/
  • /feed/
  • /category/
  • /tag/
  • /page/
  • /widgets.php/
  • /wp-json/
  • xmlrpc

Advanced Usage

The script should filter out most unwanted file types and directories; however, you can edit the regular expressions that filter out certain pages, directories, and file types by editing the fetchUrlsForDomain() function within the fetchurls.sh file.

Warning: If you're not familiar with grep or regular expressions, you can easily break the script.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].