All Projects → gjtorikian → Html Proofer

gjtorikian / Html Proofer

Licence: mit
Test your rendered HTML files to make sure they're accurate.

Programming Languages

ruby
36898 projects - #4 most used programming language

Labels

HTMLProofer

If you generate HTML files, then this tool might be for you.

Project scope

HTMLProofer is a set of tests to validate your HTML output. These tests check if your image references are legitimate, if they have alt tags, if your internal links are working, and so on. It's intended to be an all-in-one checker for your output.

In scope for this project is any well-known and widely-used test for HTML document quality. A major use for this project is continuous integration -- so we must have reliable results. We usually balance correctness over performance. And, if necessary, we should be able to trace this program's detection of HTML errors back to documented best practices or standards, such as W3 specifications.

Third-party modules. We want this product to be useful for continuous integration so we prefer to avoid subjective tests which are prone to false positive results, such as spell checkers, indentation checkers, etc. If you want to work on these items, please see the section on custom tests and consider adding an implementation as a third-party module.

Advanced configuration. Most front-end developers can test their HTML using our command line program. Advanced configuration will require using Ruby.

Installation

Add this line to your application's Gemfile:

gem 'html-proofer'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install html-proofer

NOTE: When installation speed matters, set NOKOGIRI_USE_SYSTEM_LIBRARIES to true in your environment. This is useful for increasing the speed of your Continuous Integration builds.

What's tested?

Below is mostly comprehensive list of checks that HTMLProofer can perform.

Images

img elements:

  • Whether all your images have alt tags
  • Whether your internal image references are not broken
  • Whether external images are showing
  • Whether your images are HTTP

Links

a, link elements:

  • Whether your internal links are working
  • Whether your internal hash references (#linkToMe) are working
  • Whether external links are working
  • Whether your links are HTTPS
  • Whether CORS/SRI is enabled

Scripts

script elements:

  • Whether your internal script references are working
  • Whether external scripts are loading
  • Whether CORS/SRI is enabled

Favicon

  • Whether your favicons are valid.

OpenGraph

  • Whether the images and URLs in the OpenGraph metadata are valid.

HTML

  • Whether your HTML markup is valid. This is done via Nokogumbo to validate well-formed HTML5 markup.

Usage

You can configure HTMLProofer to run on:

  • a file
  • a directory
  • an array of directories
  • an array of links

It can also run through the command-line, Docker, or as Rack middleware.

Using in a script

  1. Require the gem.
  2. Generate some HTML.
  3. Create a new instance of the HTMLProofer on your output folder.
  4. run that instance.

Here's an example:

require 'html-proofer'
require 'html/pipeline'
require 'find'

# make an out dir
Dir.mkdir("out") unless File.exist?("out")

pipeline = HTML::Pipeline.new [
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::TableOfContentsFilter
], :gfm => true

# iterate over files, and generate HTML from Markdown
Find.find("./docs") do |path|
  if File.extname(path) == ".md"
    contents = File.read(path)
    result = pipeline.call(contents)

    File.open("out/#{path.split("/").pop.sub('.md', '.html')}", 'w') { |file| file.write(result[:output].to_s) }
  end
end

# test your out dir!
HTMLProofer.check_directory("./out").run

Checking a single file

If you simply want to check a single file, use the check_file method:

HTMLProofer.check_file('/path/to/a/file.html').run

Checking directories

If you want to check a directory, use check_directory:

HTMLProofer.check_directory('./out').run

If you want to check multiple directories, use check_directories:

HTMLProofer.check_directories(['./one', './two']).run

Checking an array of links

With check_links, you can also pass in an array of links:

HTMLProofer.check_links(['http://github.com', 'http://jekyllrb.com']).run

This configures Proofer to just test those links to ensure they are valid. Note that for the command-line, you'll need to pass a special --as-links argument:

Note: flags are different from the default ones provided above. The underscores are replaced with dashes.

allow_hash_href will be --allow-hash-href

htmlproofer www.google.com,www.github.com --as-links

Using on the command-line

You'll also get a new program called htmlproofer with this gem. Terrific!

Pass in options through the command-line as flags, like this:

htmlproofer --extension .html.erb ./out

Use htmlproofer --help to see all command line options, or take a peek here.

Special cases for the command-line

For options which require an array of input, surround the value with quotes, and don't use any spaces. For example, to exclude an array of HTTP status code, you might do:

htmlproofer --http-status-ignore "999,401,404" ./out

For something like url-ignore, and other options that require an array of regular expressions, you can pass in a syntax like this:

htmlproofer --url-ignore "/www.github.com/,/foo.com/" ./out

Since url_swap is a bit special, you'll pass in a pair of RegEx:String values. The escape sequences \: should be used to produce literal :s htmlproofer will figure out what you mean.

htmlproofer --url-swap "wow:cow,mow:doh" --extension .html.erb --url-ignore www.github.com ./out

Using with Jekyll

Want to use HTML Proofer with your Jekyll site? Awesome. Simply add gem 'html-proofer' to your Gemfile as described above, and add the following to your Rakefile, using rake test to execute:

require 'html-proofer'

task :test do
  sh "bundle exec jekyll build"
  options = { :assume_extension => true }
  HTMLProofer.check_directory("./_site", options).run
end

Don't have or want a Rakefile? You can also do something like the following:

htmlproofer --assume-extension ./_site

Using through Docker

If you have trouble with (or don't want to) install Ruby/Nokogumbo, the command-line tool can be run through Docker. See klakegg/html-proofer for more information.

Using as Rack middleware

You can run html-proofer as part of your Rack middleware to validate your HTML at runtime. For example, in Rails, add these lines to config/application.rb:

  config.middleware.use HTMLProofer::Middleware if Rails.env.test?
  config.middleware.use HTMLProofer::Middleware if Rails.env.development?

This will raise an error at runtime if your HTML is invalid. You can choose to skip validation of a page by adding ?proofer-ignore to the URL.

This is particularly helpful for projects which have extensive CI, since any invalid HTML will fail your build.

Ignoring content

Add the data-proofer-ignore attribute to any tag to ignore it from every check.

<a href="http://notareallink" data-proofer-ignore>Not checked.</a>

This can also apply to parent elements, all the way up to the <html> tag:

<div data-proofer-ignore>
  <a href="http://notareallink">Not checked because of parent.</a>
</div>

Ignoring new files

Say you've got some new files in a pull request, and your tests are failing because links to those files are not live yet. One thing you can do is run a diff against your base branch and explicitly ignore the new files, like this:

  directories = %w(content)
  merge_base = `git merge-base origin/production HEAD`.chomp
  diffable_files = `git diff -z --name-only --diff-filter=AC #{merge_base}`.split("\0")
  diffable_files = diffable_files.select do |filename|
    next true if directories.include?(File.dirname(filename))
    filename.end_with?('.md')
  end.map { |f| Regexp.new(File.basename(f, File.extname(f))) }

  HTMLProofer.check_directory('./output', { url_ignore: diffable_files }).run

Configuration

The HTMLProofer constructor takes an optional hash of additional options:

Option Description Default
allow_missing_href If true, does not flag a tags missing href (this is the default for HTML5). false
allow_hash_href If true, ignores the . false
alt_ignore An array of Strings or RegExps containing imgs whose missing alt tags are safe to ignore. []
assume_extension Automatically add extension (e.g. .html) to file paths, to allow extensionless URLs (as supported by Jekyll 3 and GitHub Pages) false
check_external_hash Checks whether external hashes exist (even if the webpage exists). This slows the checker down. false
check_favicon Enables the favicon checker. false
check_opengraph Enables the Open Graph checker. false
check_html Enables HTML validation errors from Nokogumbo false
check_img_http Fails an image if it's marked as http false
check_sri Check that <link> and <script> external resources use SRI false
checks_to_ignore An array of Strings indicating which checks you do not want to run []
directory_index_file Sets the file to look for when a link refers to a directory. index.html
disable_external If true, does not run the external link checker, which can take a lot of time. false
empty_alt_ignore If true, ignores images with empty alt tags. false
enforce_https Fails a link if it's not marked as https. false
error_sort Defines the sort order for error output. Can be :path, :desc, or :status. :path
extension The extension of your HTML files including the dot. .html
external_only Only checks problems with external references. false
file_ignore An array of Strings or RegExps containing file paths that are safe to ignore. []
http_status_ignore An array of numbers representing status codes to ignore. []
internal_domains An array of Strings containing domains that will be treated as internal urls. []
log_level Sets the logging level, as determined by Yell. One of :debug, :info, :warn, :error, or :fatal. :info
only_4xx Only reports errors for links that fall within the 4xx status code range. false
root_dir The absolute path to the directory serving your html-files. ""
typhoeus_config A JSON-formatted string. Parsed using JSON.parse and mapped on top of the default configuration values so that they can be overridden. {}
url_ignore An array of Strings or RegExps containing URLs that are safe to ignore. It affects all HTML attributes. Note that non-HTTP(S) URIs are always ignored. []
url_swap A hash containing key-value pairs of RegExp => String. It transforms URLs that match RegExp into String via gsub. {}
verbose If true, outputs extra information as the checking happens. Useful for debugging. Will be deprecated in a future release. false

In addition, there are a few "namespaced" options. These are:

  • :validation
  • :typhoeus / :hydra
  • :parallel
  • :cache

See below for more information.

Configuring HTML validation rules

If check_html is true, Nokogumbo performs additional validation on your HTML.

You can pass in additional options to configure this validation.

Option Description Default
report_eof_tags When check_html is enabled, HTML markup with mismatched tags are reported as errors false
report_invalid_tags When check_html is enabled, HTML markup that is unknown to Nokogumbo are reported as errors. false
report_mismatched_tags When check_html is enabled, HTML markup with tags that are malformed are reported as errors false
report_missing_doctype When check_html is enabled, HTML markup with missing or out-of-order DOCTYPE are reported as errors. false
report_missing_names When check_html is enabled, HTML markup that are missing entity names are reported as errors. false
report_script_embeds When check_html is enabled, script tags containing markup are reported as errors. false

For example:

opts = { :check_html => true, :validation => { :report_script_embeds => true } }

Configuring Typhoeus and Hydra

Typhoeus is used to make fast, parallel requests to external URLs. You can pass in any of Typhoeus' options for the external link checks with the options namespace of :typhoeus. For example:

HTMLProofer.new("out/", {:extension => ".htm", :typhoeus => { :verbose => true, :ssl_verifyhost => 2 } })

This sets HTMLProofer's extensions to use .htm, gives Typhoeus a configuration for it to be verbose, and use specific SSL settings. Check the Typhoeus documentation for more information on what options it can receive.

You can similarly pass in a :hydra option with a hash configuration for Hydra.

The default value is:

{
  :typhoeus =>
  {
    :followlocation => true,
    :connecttimeout => 10,
    :timeout => 30
  },
  :hydra => { :max_concurrency => 50 }
}

Setting before-request callback

You can provide a block to set some logic before an external link is checked. For example, say you want to provide an authentication token every time a GitHub URL is checked. You can do that like this:

proofer = HTMLProofer.check_directory(item, opts)
proofer.before_request do |request|
  request.options[:headers]['Authorization'] = "Bearer <TOKEN>" if request.base_url == "https://github.com"
end
proofer.run

The Authorization header is being set if and only if the base_url is https://github.com, and it is excluded for all other URLs.

Configuring Parallel

Parallel can be used to speed internal file checks. You can pass in any of its options with the options namespace :parallel. For example:

HTMLProofer.check_directories(["out/"], {:extension => ".htm", :parallel => { :in_processes => 3} })

In this example, :in_processes => 3 is passed into Parallel as a configuration option.

Configuring caching

Checking external URLs can slow your tests down. If you'd like to speed that up, you can enable caching for your external links. Caching simply means to skip links that are valid for a certain period of time.

You can enable caching for this log file by passing in the option :cache, with a hash containing a single key, :timeframe. :timeframe defines the length of time the cache will be used before the link is checked again. The format of :timeframe is a number followed by a letter indicating the length of time. For example:

  • M means months
  • w means weeks
  • d means days
  • h means hours

For example, passing the following options means "recheck links older than thirty days":

{ :cache => { :timeframe => '30d' } }

And the following options means "recheck links older than two weeks":

{ :cache => { :timeframe => '2w' } }

You can change the directory where the cachefile is kept by also providing the storage_dir key:

{ :cache => { :storage_dir => '/tmp/html-proofer-cache-money' } }

Links that were failures are kept in the cache and always rechecked. If they pass, the cache is updated to note the new timestamp.

The cache operates on external links only.

If caching is enabled, HTMLProofer writes to a log file called tmp/.htmlproofer/cache.log. You should probably ignore this folder in your version control system.

Caching with Travis

If you want to enable caching with Travis CI, be sure to add these lines into your .travis.yml file:

cache:
  directories:
  - $TRAVIS_BUILD_DIR/tmp/.htmlproofer

For more information on using HTML-Proofer with Travis CI, see this wiki page.

Logging

HTML-Proofer can be as noisy or as quiet as you'd like. If you set the :log_level option, you can better define the level of logging.

Custom tests

Want to write your own test? Sure, that's possible!

Just create a class that inherits from HTMLProofer::Check. This subclass must define one method called run. This is called on your content, and is responsible for performing the validation on whatever elements you like. When you catch a broken issue, call add_issue(message, line: line, content: content) to explain the error. line refers to the line numbers, and content is the node content of the broken element.

If you're working with the element's attributes (as most checks do), you'll also want to call create_element(node) as part of your suite. This constructs an object that contains all the attributes of the HTML element you're iterating on.

Here's an example custom test demonstrating these concepts. It reports mailto links that point to [email protected]:

class MailToOctocat < ::HTMLProofer::Check
  def mailto?
    return false if @link.data_proofer_ignore || @link.href.nil?
    @link.href.match /mailto/
  end

  def octocat?
    return false if @link.data_proofer_ignore || @link.href.nil?
    @link.href.match /[email protected]/
  end

  def run
    @html.css('a').each do |node|
      @link = create_element(node)
      line = node.line

      if mailto? && octocat?
        return add_issue("Don't email the Octocat directly!", line: line)
      end
    end
  end
end

See our list of third-party custom classes and add your own to this list.

Troubleshooting

Here are some brief snippets identifying some common problems that you can work around. For more information, check out our wiki.

Our wiki page on using HTML-Proofer with Travis CI might also be useful.

Ignoring SSL certificates

To ignore SSL certificates, turn off Typhoeus' SSL verification:

HTMLProofer.check_directory("out/", {
  :typhoeus => {
    :ssl_verifypeer => false,
    :ssl_verifyhost => 0}
}).run

User-Agent

To change the User-Agent used by Typhoeus:

HTMLProofer.check_directory("out/", {
  :typhoeus => {
    :headers => { "User-Agent" => "Mozilla/5.0 (compatible; My New User-Agent)" }
}}).run

Regular expressions

To exclude urls using regular expressions, include them between forward slashes and don't quote them:

HTMLProofer.check_directories(["out/"], {
  :url_ignore => [/example.com/],
}).run

Real-life examples

Project Repository Notes
Jekyll's website jekyll/jekyll A separate script calls htmlproofer and this used to be called from Circle CI
Raspberry Pi's documentation raspberrypi/documentation
Squeak's website squeak-smalltalk/squeak.org
Atom Flight Manual atom/flight-manual.atom.io
HTML Website Template fulldecent/html-website-template A starting point for websites, uses a Rakefile and Travis configuration to call preconfigured testing
Project Calico Documentation projectcalico/calico Simple integration with Jekyll and Docker using a Makefile
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].