All Projects → sangaline → Wayback Machine Scraper

sangaline / Wayback Machine Scraper

Licence: isc
A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Wayback Machine Scraper

Cascadia
Go cascadia package command line CSS selector
Stars: ✭ 67 (-70.87%)
Mutual labels:  command-line-tool, web-scraping
Kepubify
Fast, standalone EPUB to KEPUB converter CLI app / library (and a few other utilities).
Stars: ✭ 225 (-2.17%)
Mutual labels:  command-line-tool
Reset Windows Update Tool
Troubleshooting Tool with Windows Updates (Developed in Dev-C++).
Stars: ✭ 208 (-9.57%)
Mutual labels:  command-line-tool
Klog
A plain-text file format and command line tool for time tracking
Stars: ✭ 222 (-3.48%)
Mutual labels:  command-line-tool
Geek Life
The Todo List / Task Manager for Geeks in command line
Stars: ✭ 212 (-7.83%)
Mutual labels:  command-line-tool
Xcparse
Command line tool & Swift framework for parsing Xcode 11+ xcresult
Stars: ✭ 221 (-3.91%)
Mutual labels:  command-line-tool
Wwdchelper
⏬ Help you get WWDC info easily, especially for subtitles.
Stars: ✭ 208 (-9.57%)
Mutual labels:  command-line-tool
Wpk
a friendly, intuitive & intelligent CLI for webpack
Stars: ✭ 232 (+0.87%)
Mutual labels:  command-line-tool
Vocabs
📚 A lightweight online dictionary integration to the command line. No browsers. No paperbacks.
Stars: ✭ 226 (-1.74%)
Mutual labels:  command-line-tool
City Scrapers
Scrape, standardize and share public meetings from local government websites
Stars: ✭ 220 (-4.35%)
Mutual labels:  web-scraping
Licensed
⚖️ ✔️ licensed is an interactive command line tool to help you choose and add licenses to your projects
Stars: ✭ 220 (-4.35%)
Mutual labels:  command-line-tool
Short Jokes Dataset
Python scripts for building 'Short Jokes' dataset, featured on Kaggle
Stars: ✭ 215 (-6.52%)
Mutual labels:  web-scraping
Gitlab Cli
Create a merge request from command line in gitlab
Stars: ✭ 224 (-2.61%)
Mutual labels:  command-line-tool
Dry Cli
General purpose Command Line Interface (CLI) framework for Ruby
Stars: ✭ 210 (-8.7%)
Mutual labels:  command-line-tool
Docbao
Công cụ quét và phân tích từ khoá các trang báo mạng Việt Nam
Stars: ✭ 230 (+0%)
Mutual labels:  web-scraping
R Web Scraping Cheat Sheet
Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.
Stars: ✭ 207 (-10%)
Mutual labels:  web-scraping
Innodb Java Reader
A library and command-line tool to access MySQL InnoDB data file directly in Java
Stars: ✭ 217 (-5.65%)
Mutual labels:  command-line-tool
Bolter
Command-line app for viewing BoltDB file in your terminal
Stars: ✭ 222 (-3.48%)
Mutual labels:  command-line-tool
Termchat
Terminal chat through the LAN.
Stars: ✭ 229 (-0.43%)
Mutual labels:  command-line-tool
Amber
A code search / replace tool
Stars: ✭ 230 (+0%)
Mutual labels:  command-line-tool

The Wayback Machine Scraper Logo

The Wayback Machine Scraper

The repository consists of a command-line utility wayback-machine-scraper that can be used to scrape or download website data as it appears in archive.org's Wayback Machine. It crawls through historical snapshots of a website and saves the snapshots to disk. This can be useful when you're trying to scrape a site that has scraping measures that make direct scraping impossible or prohibitively slow. It's also useful if you want to scrape a website as it appeared at some point in the past or to scrape information that changes over time.

The command-line utility is highly configurable in terms of what it scrapes but it only saves the unparsed content of the pages on the site. If you're interested in parsing data from the pages that are crawled then you might want to check out scrapy-wayback-machine instead. It's a downloader middleware that handles all of the tricky parts and passes normal response objects to your Scrapy spiders with archive timestamp information attached. The middleware is very unobtrusive and should work seamlessly with existing Scrapy middlewares, extensions, and spiders. It's what wayback-machine-scraper uses behind the scenes and it offers more flexibility for advanced use cases.

Installation

The package can be installed using pip.

pip install wayback-machine-scraper

Command-Line Interface

Writing a custom Scrapy spider and using the WaybackMachine middleware is the preferred way to use this project, but a command line interface for basic mirroring is also included. The usage information can be printed by running wayback-machine-scraper -h.

usage: wayback-machine-scraper [-h] [-o DIRECTORY] [-f TIMESTAMP]
                               [-t TIMESTAMP] [-a REGEX] [-d REGEX]
                               [-c CONCURRENCY] [-u] [-v]
                               DOMAIN [DOMAIN ...]

Mirror all Wayback Machine snapshots of one or more domains within a specified
time range.

positional arguments:
  DOMAIN                Specify the domain(s) to scrape. Can also be a full
                        URL to specify starting points for the crawler.

optional arguments:
  -h, --help            show this help message and exit
  -o DIRECTORY, --output DIRECTORY
                        Specify the domain(s) to scrape. Can also be a full
                        URL to specify starting points for the crawler.
                        (default: website)
  -f TIMESTAMP, --from TIMESTAMP
                        The timestamp for the beginning of the range to
                        scrape. Can either be YYYYmmdd, YYYYmmddHHMMSS, or a
                        Unix timestamp. (default: 10000101)
  -t TIMESTAMP, --to TIMESTAMP
                        The timestamp for the end of the range to scrape. Use
                        the same timestamp as `--from` to specify a single
                        point in time. (default: 30000101)
  -a REGEX, --allow REGEX
                        A regular expression that all scraped URLs must match.
                        (default: ())
  -d REGEX, --deny REGEX
                        A regular expression to exclude matched URLs.
                        (default: ())
  -c CONCURRENCY, --concurrency CONCURRENCY
                        Target concurrency for crawl requests.The crawl rate
                        will be automatically adjusted to match this
                        target.Use values less than 1 to be polite and higher
                        values to scrape more quickly. (default: 10.0)
  -u, --unix            Save snapshots as `UNIX_TIMESTAMP.snapshot` instead of
                        the default `YYYYmmddHHMMSS.snapshot`. (default:
                        False)
  -v, --verbose         Turn on debug logging. (default: False)

Examples

The usage can be perhaps be made more clear with a couple of concrete examples.

A Single Page Over Time

One of the key advantages of wayback-machine-scraper over other projects, such as wayback-machine-downloader, is that it offers the capability to download all available archive.org snapshots. This can be extremely useful if you're interested in analyzing how pages change over time.

For example, say that you would like to analyze many snapshots of the Hacker News front page as I did writing Reverse Engineering the Hacker News Algorithm. This can be done by running

wayback-machine-scraper -a 'news.ycombinator.com$' news.ycombinator.com

where the --allow regular expression news.ycombinator.com$ limits the crawl to the front page. This produces a file structure of

website/
└── news.ycombinator.com
    ├── 20070221033032.snapshot
    ├── 20070226001637.snapshot
    ├── 20070405032412.snapshot
    ├── 20070405175109.snapshot
    ├── 20070406195336.snapshot
    ├── 20070601184317.snapshot
    ├── 20070629033202.snapshot
    ├── 20070630222527.snapshot
    ├── 20070630222818.snapshot
    └── etc.

with each snapshot file containing the full HTML body of the front page.

A series of snapshots for any page can be obtained in this way as long as suitable regular expressions and start URLs are constructed. If we are interested in a page other than the homepage then we should use it as the start URL instead. To get all of the snapshots for a specific story we could run

wayback-machine-scraper -a 'id=13857086$' 'news.ycombinator.com/item?id=13857086'

which produces

website/
└── news.ycombinator.com
    └── item?id=13857086
        ├── 20170313225853.snapshot
        ├── 20170313231755.snapshot
        ├── 20170314043150.snapshot
        ├── 20170314165633.snapshot
        └── 20170320205604.snapshot

A Full Site Crawl at One Point In Time

If the goal is to take a snapshot of an entire site at once then this can also be easily achieved. Specifying both the --from and --to options as the same point in time will assure that only one snapshot is saved for each URL. Running

wayback-machine-scraper -f 20080623 -t 20080623 news.ycombinator.com

produces a file structure of

website
└── news.ycombinator.com
    ├── 20080621143814.snapshot
    ├── item?id=221868
    │   └── 20080622151531.snapshot
    ├── item?id=222157
    │   └── 20080622151822.snapshot
    ├── item?id=222341
    │   └── 20080620221102.snapshot
    └── etc.

with a single snapshot for each page in the crawl as it appeared on June 23, 2008.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].