All Projects → alphagov → govuk_crawler_worker

alphagov / govuk_crawler_worker

Licence: MIT license
A worker that will consume GOV.UK URLs from a message queue and crawl them, saving the output to disk

Programming Languages

go
31211 projects - #10 most used programming language
Makefile
30231 projects

Labels

Projects that are alternatives of or similar to govuk crawler worker

Smart Answers
Serves smart answers on GOV.UK
Stars: ✭ 148 (+770.59%)
Mutual labels:  govuk
gds-nodejs-boilerplate
A Node.js project boilerplate for production apps
Stars: ✭ 18 (+5.88%)
Mutual labels:  govuk
slimmer
Templating Rack middleware, injects standard header/footer and GOV.UK Components
Stars: ✭ 30 (+76.47%)
Mutual labels:  govuk
Styleguides
GOV.UK coding standards and guidelines for other tools we use
Stars: ✭ 179 (+952.94%)
Mutual labels:  govuk
Frontend
Serves the homepage, transactions and some index pages on GOV.UK
Stars: ✭ 234 (+1276.47%)
Mutual labels:  govuk
search-api
Search API for GOV.UK
Stars: ✭ 21 (+23.53%)
Mutual labels:  govuk
Static
GOV.UK static files and resources
Stars: ✭ 100 (+488.24%)
Mutual labels:  govuk
sketch wireframing kit
Quick Sketchapp wireframing tool for UK government digital services
Stars: ✭ 74 (+335.29%)
Mutual labels:  govuk
govuk-components
Lightweight components for developing with the GOV.UK Design System
Stars: ✭ 84 (+394.12%)
Mutual labels:  govuk
publishing-api
API to publish content on GOV.UK
Stars: ✭ 29 (+70.59%)
Mutual labels:  govuk
Router
Router in front on GOV.UK to proxy to backend servers on the single domain
Stars: ✭ 181 (+964.71%)
Mutual labels:  govuk
Govuk React
An implementation of the GOV.UK Design System in React using CSSinJS
Stars: ✭ 219 (+1188.24%)
Mutual labels:  govuk
govuk-terraform-provisioning
**DEPRECATED** Terraform configuration and utilities to provision parts of the GOV.UK AWS Infrastructure
Stars: ✭ 17 (+0%)
Mutual labels:  govuk
Magna Charta
Accessible, useful, beautiful barcharts from HTML tables.
Stars: ✭ 151 (+788.24%)
Mutual labels:  govuk
content-data-api
Data warehouse that stores content and content metrics to help content owners measure and improve content on GOV.UK
Stars: ✭ 13 (-23.53%)
Mutual labels:  govuk
Govuk Puppet
Puppet manifests used to provision the main GOV.UK web stack
Stars: ✭ 109 (+541.18%)
Mutual labels:  govuk
smokey
Smoke tests for GOV.UK
Stars: ✭ 42 (+147.06%)
Mutual labels:  govuk
finder-frontend
Serves finder and search pages for GOV.UK
Stars: ✭ 15 (-11.76%)
Mutual labels:  govuk
publisher
Publishes mainstream content on GOV.UK
Stars: ✭ 42 (+147.06%)
Mutual labels:  govuk
collections
Serves GOV.UK navigation pages, browse, topic, step-by-steps & services and information pages.
Stars: ✭ 32 (+88.24%)
Mutual labels:  govuk

GOV.UK Crawler Worker

continuous integration status

This is a worker that will consume GOV.UK URLs from a message queue and crawl them, saving the output to disk.

Requirements

To run this worker you will need:

Development

You can run the tests locally by running make.

This project uses Godep to manage it's dependencies. If you have a working Go development setup, you should be able to install Godep by running:

go get github.com/tools/godep

Running

To run the worker you'll first need to build it using go build to generate a binary. You can then run the built binary directly using ./govuk_crawler_worker. All configuration is injected using environment varibles. For details on this look at the main.go file.

How it works

This is a message queue worker that will consume URLs from a queue and crawl them, saving the output to disk. Whilst this is the main reason for this worker to exist it has a few activities that it covers before the page gets written to disk.

Workflow

The workflow for the worker can be defined as the following set of steps:

  1. Read a URL from the queue, e.g. https://www.gov.uk/bank-holidays
  2. Crawl the recieved URL
  3. Write the body of the crawled URL to disk
  4. Extract any matching URLs from the HTML body of the crawled URL
  5. Publish the extracted URLs to the worker's own exchange
  6. Acknowledge that the URL has been crawled

The Interface

The public interface for the worker is the exchange labelled govuk_crawler_exchange. When the worker starts it creates this exchange and binds it to it's own queue for consumption.

If you provide user credentials for RabbitMQ that aren't on the root vhost /, you may wish to bind a global exchange yourself for easier publishing by other applications.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].