web-monitoring-processing
A component of the EDGI Web Monitoring Project.
Overview of this component's tasks
This component is intended to hold various backend tools serving different tasks:
- Query external sources of captured web pages (e.g. Internet Archive, Page Freezer, Sentry), and formulate a request for importing their version and page metadata into web-monitoring-db.
- Query web-monitoring-db for new Changes, analyze them in an automated pipeline to assign priority and/or filter out uninteresting ones, and submit this information back to web-monitoring-db.
Development status
Working and Under Active Development:
- A Python API to the web-monitoring-db Rails app in
web_monitoring.db
- Python functions and a command-line tool for importing snapshots from the Internet Archive into web-monitoring-db.
Legacy projects that may be revisited:
- Example HTML providing useful test cases.
Installation Instructions
-
Get Python 3.7. This packages makes use of modern Python features and requires Python 3.7+. If you don't have Python 3.7, we recommend using conda to install it. (You don't need admin privileges to install or use it, and it won't interfere with any other installations of Python already on your system.)
-
Install libxml2 and libxslt. (This package uses lxml, which requires your system to have the libxml2 and libxslt libraries.)
On MacOS, use Homebrew:
brew install libxml2 brew install libxslt
On Debian Linux:
apt-get install libxml2-dev libxslt-dev
On other systems, the packages might have slightly different names.
-
Install the package.
pip install -r requirements.txt python setup.py develop
-
Copy the script
.env.example
to.env
and supply any local configuration info you need. (Only some of the package's functionality requires this.) Apply the configuration:source .env
-
See module comments and docstrings for more usage information. Also see the command line tool
wm
, which is installed with the package. For help, usewm --help
-
To run the tests or build the documentation, first install the development requirements.
pip install -r requirements-dev.txt
-
To build the docs:
cd docs make html
-
To run the tests:
python run_tests.py
Any additional arguments are passed through to
py.test
.
Releases
We try to make sure the code in this repoโs main
branch is always in a stable, usable state, but occasionally coordinated functionality may be written across multiple commits. If you are depending on this package from another Python program, you may wish to install from the release
branch instead:
$ pip install git+https://github.com/edgi-govdata-archiving/web-monitoring-processing@release
You can also list the git+https:
URL above in a pip requirements file.
We usually create merge commits on the release
branch that note the PRs included in the release or any other relevant notes (e.g. Release #302 and #313.
).
Code of Conduct
This repository falls under EDGI's Code of Conduct.
Contributors
This project wouldnโt exist without a lot of amazing peopleโs help. Thanks to the following for all their contributions! See our contributing guidelines to find out how you can help.
(For a key to the contribution emoji or more info on this format, check out โAll Contributors.โ)
License & Copyright
Copyright (C) 2017-2021 Environmental Data and Governance Initiative (EDGI)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the LICENSE
file for details.