All Projects → peterk → munin-indexer

peterk / munin-indexer

Licence: GPL-3.0 license
A social media open post web archiving tool

Programming Languages

javascript
184084 projects - #8 most used programming language
HTML
75241 projects
CSS
56736 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to munin-indexer

warcworker
A dockerized, queued high fidelity web archiver based on Squidwarc
Stars: ✭ 48 (+200%)
Mutual labels:  archiving, preservation, webarchiving, high-fidelity-preservation
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+225%)
Mutual labels:  archiving, webarchiving
Jarchivelib
A simple archiving and compression library for Java
Stars: ✭ 162 (+912.5%)
Mutual labels:  archiving
checkit tiff
"checkit_tiff" is an incredibly fast conformance checker for baseline TIFFs (with various extensions)
Stars: ✭ 14 (-12.5%)
Mutual labels:  preservation
Archiveis
A simple Python wrapper for the archive.is capturing service
Stars: ✭ 140 (+775%)
Mutual labels:  archiving
Archivebot
ArchiveBot, an IRC bot for archiving websites
Stars: ✭ 218 (+1262.5%)
Mutual labels:  archiving
rscplus
RuneScape Classic client mod & preservation platform
Stars: ✭ 29 (+81.25%)
Mutual labels:  preservation
dbptk-ui
DBPTK base UI for both Desktop and Enterprise
Stars: ✭ 20 (+25%)
Mutual labels:  preservation
islandora vagrant
Islandora testing and development environment
Stars: ✭ 36 (+125%)
Mutual labels:  preservation
chronicle-etl
📜 A CLI toolkit for extracting and working with your digital history
Stars: ✭ 78 (+387.5%)
Mutual labels:  archiving
anchorage
Save your bookmark collection in the Internet Archive, or locally.
Stars: ✭ 19 (+18.75%)
Mutual labels:  archiving
Reprozip
ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.
Stars: ✭ 231 (+1343.75%)
Mutual labels:  archiving
savepagenow
A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service
Stars: ✭ 140 (+775%)
Mutual labels:  archiving
Pdf Archiver
A tool for tagging files and archiving tasks.
Stars: ✭ 182 (+1037.5%)
Mutual labels:  archiving
deptoolkit
The Toolkit API, app, and browser extension. Start preserving now.
Stars: ✭ 40 (+150%)
Mutual labels:  archiving
Wikipedia Mirror
🌐 Guide and tools to run a full offline mirror of Wikipedia.org with three different approaches: Nginx caching proxy, Kimix + ZIM dump, and MediaWiki/XOWA + XML dump
Stars: ✭ 160 (+900%)
Mutual labels:  archiving
irc-docs
Collected IRC protocol documentation
Stars: ✭ 47 (+193.75%)
Mutual labels:  archiving
Unifiedarchive
UnifiedArchive - an archive manager with a unified way for different formats. Supports all basic (listing, reading, extracting and creation) and specific features (compression level, password-protection). Bundled with console program for working with archives.
Stars: ✭ 246 (+1437.5%)
Mutual labels:  archiving
wail
🐋 One-Click User Instigated Preservation
Stars: ✭ 107 (+568.75%)
Mutual labels:  high-fidelity-preservation
jupyter-archive
A Jupyter/Jupyterlab extension to make, download and extract archive files.
Stars: ✭ 57 (+256.25%)
Mutual labels:  archiving

Munin - a social media archiver

This tool will monitor open Facebook, Instagram and VKontakte account seeds for new posts and archive those posts. Posts are archived in the WARC file format using the excellent Squidwarc package. A playback tool and a simple dashboard is available to monitor collections.

Munin dashboard screenshot

System overview

Munin builds on great software by other people. Indexing of post items is done in snscrape. Archiving of individual pages is done with Squidwarc. Playback of WARC files is enabled by pywb.

System overview - a Django application manages seeds and post URL:s in a PostgreSQL database. A queue for indexing finds more post URLs for the seeds. A queue for archiving makes sure post URLs are archived.

Install

  1. To run you need to install Docker and Docker Compose. It has only been tested on Linux and Mac OSX currently and the instructions below are for those platforms. Make sure you have git installed.

  2. Clone this repository

$ git clone https://github.com/peterk/munin-indexer

  1. Enter the directory and create an empty data directory for postgres

$ cd munin-indexer $ mkdir data

  1. Set up environment variables

Rename the example_env_file to env_file and update it with your settings. You should change the time zone (TZ) to match your location (see the list of time zone names here).

Start everything:

$ docker-compose up -d

The first time the application starts it can take a while (several minutes) before the application becomes available. You can monitor progress by watching the docker logs.

Set up a superuser when the application is up (it will ask you for details to create an administrator):

$ docker-compose exec web python manage.py createsuperuser

Login to the admin dashboard with the newly created superuser at http://0.0.0.0:4444/admin

Start by adding your first Collection item in the admin interface. Then add one or more seed URLs to the collection (e.g. https://www.facebook.com/visitberlin/). You can bulk add multiple seeds (one per line) fron the dashboard.

After a couple of minutes the crawler should have discovered public posts and archived them. You can monitor the dashboard for new items added to the collection. Clicking the play icon will open the archived page. All archived pages are available for playback from http://0.0.0.0:4445/munin/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].