All Projects → gingerbeardman → iwata-asks-downloader

gingerbeardman / iwata-asks-downloader

Licence: MIT License
Tool to download Iwata Asks interviews (none of which are stored in this repo)

Programming Languages

python
139335 projects - #7 most used programming language
CSS
56736 projects
HTML
75241 projects
shell
77523 projects

Projects that are alternatives of or similar to iwata-asks-downloader

Git History
Quickly browse the history of a file from any git repository
Stars: ✭ 12,676 (+74464.71%)
Mutual labels:  text, history
desafios-iddog
Desafio iddog para frontend e mobile
Stars: ✭ 21 (+23.53%)
Mutual labels:  interviews
sinonimo
🇧🇷 Sinonimo é um pacote Node que traz sinônimos de palavras em português
Stars: ✭ 14 (-17.65%)
Mutual labels:  text
pytextcodifier
📦 Turn your text files into codified images or your codified images into text files.
Stars: ✭ 14 (-17.65%)
Mutual labels:  text
SuperPhotoStudio
Take pictures of your favorite characters, in glorious Hori-HD (800px mode)!
Stars: ✭ 16 (-5.88%)
Mutual labels:  nintendo
TextInputLayout
The objective of this code is to guide you to create login screen with TextInputLayout in iOS app.
Stars: ✭ 30 (+76.47%)
Mutual labels:  text
UndoRedo.js
A powerful and simple JavaScript library provides a history for undo/redo functionality. Just like a time machine! 🕐
Stars: ✭ 19 (+11.76%)
Mutual labels:  history
Java-Rule-Book
Basic concepts of Java to answer any question about how Java works
Stars: ✭ 36 (+111.76%)
Mutual labels:  interviews
awesome-web-styling
Awesome Web Styling with CSS Animation Effects ⭐️
Stars: ✭ 109 (+541.18%)
Mutual labels:  text
NostalgiaLite
Three game emulators: FC(Nes), GG, GBC for Android
Stars: ✭ 85 (+400%)
Mutual labels:  nintendo
programming-history
Inspired by Cajori’s A History of Mathematical Notations, and/or TV Tropes.
Stars: ✭ 44 (+158.82%)
Mutual labels:  history
valorant.js
This is an unofficial NodeJS library for interacting with the VALORANT API used in game.
Stars: ✭ 48 (+182.35%)
Mutual labels:  history
markdown-utils
Convert plain text into snippets of markdown.
Stars: ✭ 28 (+64.71%)
Mutual labels:  text
WorldSim
2D tile-based sandbox RPG with procedurally generated fantasy world simulator 🌏
Stars: ✭ 19 (+11.76%)
Mutual labels:  history
ss-search
The most basic, yet powerful text search.
Stars: ✭ 41 (+141.18%)
Mutual labels:  text
ultra-router
Router for component-based web apps. Pair with React or <BYOF />.
Stars: ✭ 35 (+105.88%)
Mutual labels:  history
texthighlighter
a no dependency typescript npm package for highlighting user selected text
Stars: ✭ 17 (+0%)
Mutual labels:  text
bitcoin-development-history
Data and a example for a open source timeline of the history of Bitcoin development
Stars: ✭ 27 (+58.82%)
Mutual labels:  history
eBookReaderNX
A Nintendo Switch eBook Reader
Stars: ✭ 15 (-11.76%)
Mutual labels:  nintendo
split
A string split function and iterator for Lua
Stars: ✭ 15 (-11.76%)
Mutual labels:  text

Iwata Asks Downloader

This tool downloads the Iwata Asks series of interviews, saving as Markdown and HTML with images.

I created this tool in Spring/Summer 2019 so that I could more easily read and search the Iwata Asks interviews.

Note: This tool was developed and tested on macOS, and works on Linux, but I'm not sure how/if it works on Windows.

Fund Development

You can fund development of this tool, or just say thanks, through one of the following:

Your support is appreciated!

Copyright Notice

  • None of the Iwata Asks interview content is stored here!
  • The Iwata Asks interview content remains copyright of its creators.
  • This tool and its output is meant for personal use only.
  • Don't do anything you shouldn't do with the content.
  • Watch out for the Ninjas!

Prerequisites

  • Python 3, with:
    • pip (see here) which may require Xcode Command-line Tools ($ xcode-select --install)
    • markdown ($ python -m pip install markdown)
    • jinja2 ($ python -m pip install jinja2)
    • Pillow ($ python -m pip install Pillow)
  • Scrapy ($ pip install scrapy)
  • Pandoc ($ brew install pandoc)

Note: macOS Catalina users will need to use pip3 and add --user to the end of each such command

Usage

  1. Make sure you're running Python 3 ($ python -V)
  2. Run the scraper using the script as follows: ./get_all.sh iwata-eu.csv
  3. Watch the progress bar as the process completes (approx. 25 minutes on first run)
  4. Output is placed in the _md, _html and _images folders

Optional (requires pandoc)

  • Run to_epub.sh to convert the HTML files to EPUB

How does this work?

Scrapy is a framework for creating web spiders.

A web spider loads a web page and extracts content from it according to defined rules/logic/programming.

This tool uses a list of URLs for the first page of each interview (iwata-eu.csv) to feed the scraper, whose web spider (iwata-eu.py) extracts the content and automatically includes subsequent pages by following the original page navigation links. The main loop process is controlled by a shell script (get_all.sh).

Currently the scraper only works on the EU series of interviews due to their static page structure being more suitable (the USA interviews use AJAX to load content). The EU list has 178 seed URLs, most of which have multiple pages, so download and processing of over 30,000 files takes quite a while the first time (approx. 25 minutes). Subsequent runs will use cached data and be much quicker (appox. 13 minutes). The final resulting output should be 178 files each of Markdown/HTML, along with 3,416 images.

The scraper parses out the following content:

  • Page Title (title)
  • Section Heading (heading)
  • Interviewer Name (name)
  • Interviewer Text (text)
  • Related Image (image)

The content from multiple pages is processed and reformatted, as Markdown and HTML, and finally saved to disk as a single file.

Note: HTML generation accounts for approx. 3 minutes of processing time.

Generating ePub

Single ePub versions of each HTML file can be generated using the sctipt to_epub.sh

Finally, you can combine the ePub files into one book using script: (TO DO)

Content Status

Output Generates Validates Notes
Markdown Needs linting/tidying
HTML Needs linting/tidying
ePub Links need to be internalised

Development Setup

You'll need to familiarise yourself with Scrapy and go through their tutorial before diving in.

Important files:

  • /iwata-eu.csv (list of seed URLs)
  • /iwata/ (folder)
  • /iwata/pipelines.py (pipeline definitions)
  • /iwata/settings.py (settings, including debug pipelines)
  • /iwata/spiders/ (folder)
  • /iwata/spiders/iwata-eu.py (the most important file, the spider itself!)

Notes

  • You'll see notes about command lines used to test the spider that I use in the CodeRunner app, but you should be able to use them on the command line too.
  • Scrapy caches content in /.scrapy/httpcache so you can develop using a cache of the pages rather than wait for downloading each time.
  • I recommend developing using a subset of pages and only use the full list (iwata-eu.csv) for your final output.

Contributions

I will happily accept and merge any PR that improves this tool. I wrote this as I learned about Scrapy so there is undoubtedly room for improvement. Contributions are very welcome!

  • Optimisation that speed up any part of the processing
  • Improvements to readable output
  • Improvements to format conversion
  • Adding missing interviews (each source will require a new spider)
  • Improvements to README.md

Changelog

  • 2020-01-10: Now uses accurate progress bar
  • 2020-01-06: Added EPUB generation
  • 2020-01-05: Public Release
  • 2019-07-03: Support for multiple URLs
  • 2019-06-22: Saves as Markdown and HTML
  • 2019-04-15: Initial scraper and spider

Licence

MIT

Screenshots

Online Online

Local Local

ePub ePub

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].