All Projects → synapta → cassandra-GLAM-tools

synapta / cassandra-GLAM-tools

Licence: AGPL-3.0 license
Support GLAMs in monitoring and evaluating their cooperation with Wikimedia projects

Programming Languages

javascript
184084 projects - #8 most used programming language
HTML
75241 projects
CSS
56736 projects
python
139335 projects - #7 most used programming language
Smarty
1635 projects
PLpgSQL
1095 projects

Projects that are alternatives of or similar to cassandra-GLAM-tools

Mediawiki
MediaWiki API wrapper in python http://pymediawiki.readthedocs.io/en/latest/
Stars: ✭ 89 (+423.53%)
Mutual labels:  mediawiki, wikipedia
Wikipedia Mirror
🌐 Guide and tools to run a full offline mirror of Wikipedia.org with three different approaches: Nginx caching proxy, Kimix + ZIM dump, and MediaWiki/XOWA + XML dump
Stars: ✭ 160 (+841.18%)
Mutual labels:  mediawiki, wikipedia
DiscordWikiBot
Discord bot for Wikimedia projects and MediaWiki wiki sites
Stars: ✭ 30 (+76.47%)
Mutual labels:  mediawiki, wikipedia
Apps Android Wikipedia
📱The official Wikipedia app for Android!
Stars: ✭ 1,350 (+7841.18%)
Mutual labels:  mediawiki, wikipedia
Infoboxer
Wikipedia information extraction library
Stars: ✭ 147 (+764.71%)
Mutual labels:  mediawiki, wikipedia
discord-wiki-bot
Wiki-Bot is a bot with the purpose to easily search for and link to wiki pages. Wiki-Bot shows short descriptions and additional info about the pages and is able to resolve redirects and follow interwiki links.
Stars: ✭ 69 (+305.88%)
Mutual labels:  mediawiki, wikipedia
Wptools
Wikipedia tools (for Humans): easily extract data from Wikipedia, Wikidata, and other MediaWikis
Stars: ✭ 371 (+2082.35%)
Mutual labels:  mediawiki, wikipedia
wikibot
Some MediaWiki bot examples including wikipedia, wikidata using MediaWiki module of CeJS library. 採用 CeJS MediaWiki 自動化作業用程式庫來製作 MediaWiki (維基百科/維基數據) 機器人的範例。
Stars: ✭ 26 (+52.94%)
Mutual labels:  mediawiki, wikipedia
Mwclient
Python client library to interface with the MediaWiki API
Stars: ✭ 221 (+1200%)
Mutual labels:  mediawiki, wikipedia
Mwparserfromhell
A Python parser for MediaWiki wikicode
Stars: ✭ 440 (+2488.24%)
Mutual labels:  mediawiki, wikipedia
Mediawiker
Mediawiker is a plugin for Sublime Text editor that adds possibility to use it as Wiki Editor on Mediawiki based sites like Wikipedia and many other.
Stars: ✭ 120 (+605.88%)
Mutual labels:  mediawiki, wikipedia
Huggle3 Qt Lx
Huggle is an anti-vandalism tool for use on MediaWiki based projects
Stars: ✭ 143 (+741.18%)
Mutual labels:  mediawiki, wikipedia
Mediawiki
🌻 The collaborative editing software that runs Wikipedia. Mirror from https://gerrit.wikimedia.org/g/mediawiki/core. See https://mediawiki.org/wiki/Developer_access for contributing.
Stars: ✭ 2,752 (+16088.24%)
Mutual labels:  mediawiki, wikipedia
Linq To Wiki
.Net library to access MediaWiki API
Stars: ✭ 93 (+447.06%)
Mutual labels:  mediawiki, wikipedia
wikiapi
JavaScript MediaWiki API for node.js
Stars: ✭ 28 (+64.71%)
Mutual labels:  mediawiki, wikipedia
copyvios
A copyright violation detector running on Wikimedia Cloud Services
Stars: ✭ 32 (+88.24%)
Mutual labels:  mediawiki, wikipedia
Wikiteam
Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2020, WikiTeam has preserved more than 250,000 wikis.
Stars: ✭ 404 (+2276.47%)
Mutual labels:  mediawiki, wikipedia
Jwiki
📖 A library for effortlessly interacting with Wikipedia/MediaWiki
Stars: ✭ 69 (+305.88%)
Mutual labels:  mediawiki, wikipedia
Mwoffliner
Scrape any online Mediawiki motorised wiki (like Wikipedia) to your local filesystem
Stars: ✭ 121 (+611.76%)
Mutual labels:  mediawiki, wikipedia
Zhconv
Simple conversion and localization between simplified and traditional Chinese using tables from MediaWiki.
Stars: ✭ 192 (+1029.41%)
Mutual labels:  mediawiki

The purpose of this project is to support GLAMs in monitoring and evaluating their cooperation with Wikimedia projects. Starting from a Wikimedia Commons category this tool collects data about usage, views, contributors, and topology of the files inside.

The GLAM Statistical Tool "Cassandra" is a project of Wikimedia Switzerland (WMCH) and the result of a long-term collaboration with Swiss cultural institutions expressing their needs for measuring the impact of Wikimedia projects. Together with our GLAM Partner Network, we went through the process of requirement engineering and the respective solution development with our IT-Partner Synapta. Since the first release in 2017, we have thoroughly and continuously enhanced Cassandra to the extraordinary tool it is today.

In keeping the spirit of the Wikimedia movement alive and supporting the mission to make cultural knowledge freely accessible to the world, we aim to share Cassandra for the benefit of other GLAM institutions across the globe. We have already started to implement the strategy of a global roll-out and will foster the implementation in late 2021 and from 2022 onwards.

If you are interested in adopting Cassandra in your country, please contact us at Wikimedia Switzerland.

Tool architecture

This tool is based on a web application developed with Node.js (app/), a dashboarding system (Metabase), a recommendation script written in Python (recommender/), and two ETL pipelines designed to extract file usage statistics and views (etl/). Statistical data is stored in a PostgreSQL server (every GLAM is associated with a different database), while GLAM metadata is stored in MongoDB. File usage statistics are obtained from the Wikimedia Commons database replica available on Toolforge. For this reason, an SSH tunnel needs to be created between the server running Cassandra and Toolforge. File views are obtained by downloading and processing the mediacounts dataset. The following diagram summarizes the whole architecture of the tool.

Cassandra architecture

Requirements

Hardware requirements

Cassandra should be installed on a server with at least 2 CPUs and 2 GB of RAM, even if 4 GB of RAM are strongly suggested. Disk space usage depends on multiple factors, but you can estimate on average 1 GB per 10k files. For example, for a Wikimedia Commons category with 100k files (or 10 GLAMs with 10k files each), a minimum disk space of 20 GB is suggested (10 GB for the database, 10 GB for Ubuntu, the tool, and temporary files).

Software requirements

The installation procedure has been tested with Ubuntu 20.04, Node.js 12, Python 3.8, PostgreSQL 12, MongoDB 3.6, and Java 11. The only requirement is to have an empty (and disposable) Ubuntu 20.04 machine, as all the dependencies will be installed automatically. You should be able to login as root with SSH to initiate the installation. You should also have an active account on Toolforge configured to accept connections using a passwordless SSH key. Further details on how to obtain it are available in the Toolforge Quickstart guide.

Installation

The installation procedure has been scripted using the automation tool Ansible. For this reason, you need to first install Ansible on your local machine (not on the remote machine where Cassandra will be installed). Please refer to the Installing Ansible guide.

On your local machine clone this repository:

git clone https://github.com/synapta/cassandra-GLAM-tools.git

In the deploy/ directory create an inventory.ini file similar to the following, where host.example.com is the hostname (or the IP address) of the remote machine:

[cassandra]
host.example.com

In the deploy/ directory create an id_rsa_cassandra file with the SSH private key configured to login on Toolforge.

Edit the file deploy/ansible.yml by setting appropriate values for the following variables:

  • postgres_password: a password of your choice for PostgreSQL;
  • admin_password: a password of your choice for the admin user of Cassandra;
  • user_password: a password of your choice for the non-admin user of Cassandra;
  • wmf_login: the username associated with your Toolforge account;
  • wmf_user: the database user associated with your Toolforge account;
  • wmf_password: the database password associated with your Toolforge account.

Please note that the Toolforge database user and password are available in the file replica.my.cnf in your Toolforge home directory.

From the deploy/ directory, run the Ansible installation script:

ansible-playbook ansible.yml -i inventory.ini

This script will:

  • install the software requirements (e.g. Node.js, PostgreSQL, MongoDB);
  • create a passwordless glam user;
  • download the Cassandra tool;
  • install the required Node.js and Python packages;
  • setup the PostgreSQL server;
  • create an SSH tunnel to Toolforge;
  • install and start Metabase;
  • enable the periodic runs of the ETL pipelines and the recommender.

The tool will be available on port 8081, while Metabase on port 3000.

Create a Metabase administrator, following the guided procedure available at http://host.example.com:3000 (the host you set in the inventory.ini file). You will need to provide your name, surname, email, password, and company. Do not add any data by selecting "I'll add my data later". After Metabase setup is complete, enable the sharing feature: go to Settings (the gear icon), Admin, "Embedding in other Applications" and then select Enable. Take note of the embedding secret key provided by Metabase.

Edit on the remote machine the file /home/glam/cassandra-GLAM-tools/config/config.json by setting the Metabase username (the email of the Metabase administrator), the associated password, and the embedding secret key.

After editing the configuration file, it is necessary to restart the tool:

supervisorctl restart cassandra

To enable the update and restart of the tool from the administrative area, it is necessary to add the following line at the end of the file that you can open with sudo visudo:

glam ALL=(ALL) NOPASSWD: /usr/bin/supervisorctl

It may be necessary to revise the maximum number of categories and files acceptable per GLAM that are provided in the configuration file. Default values are 1k categories and 500k files. If you add a GLAM breaking these limits, it will not be processed and displayed to users.

The Cassandra tool is now available at http://host.example.com:8081. For production use, it is strongly suggested to enable a firewall and to serve the external traffic with an encrypted connection, for example by installing and properly configuring NGINX. The Cassandra tool and Metabase must be associated with different domains. The Metabase URL should be set in the Cassandra configuration file.

Pontoon installation

Note: you do not need to install Pontoon, instead use: https://pontoon.wikimedia.swiss

This is the guide we followed to install that instance.

An Ansible script to install Pontoon is available in the deploy directory.

In the deploy/ directory create an inventory.ini file similar to the following, where host.example.com is the hostname (or the IP address) of the remote machine:

[pontoon]
host.example.com

This hostname could be the same machine where Cassandra is installed or another machine of your choice. Edit the file pontoon.yml and set the postgres_password. This will be the password of the PostgreSQL pontoon user. Then run the Ansible install script:

ansible-playbook pontoon.yml -i inventory.ini

Edit the file /home/glam/pontoon/.env and update the SITE_URL to a real domain. Install and configure nginx to proxy that website to http://localhost:8000. You will need to obtain an HTTPS certificate, for example with Certbot.

Create the first Pontoon administrator in /home/glam/pontoon:

pipenv run python manage.py createsuperuser

Create a new GitHub user and associate it with an SSH key. Give to that user the write permissions on this repository. Save the SSH key in the directory /home/glam/.ssh. Finally, restart Pontoon:

supervisorctl restart pontoon

Update

To update Cassandra to the latest version available in its repository, for example after the release of a new version or the creation of a new translation, it is necessary to login into the machine that hosts the tool. Then, it is recommended to become the glam user:

sudo su glam

Change the current directory to the directory where the tool is located and update the codebase from GitHub:

cd /home/glam/cassandra-GLAM-tools && git pull

Finally, it is necessary to restart the tool:

supervisorctl restart cassandra

An easier way of updating the tool is to click the "update tool" button available in Cassandra "Control Panel" (see below).

Usage

To enter into the "Control Panel" use the path /admin/panel. Login as the user "admin" with the admin_password of your choice. To create a new GLAM, select "Add new GLAM". You will need to provide a GLAM ID (this will be the name of the PostgreSQL database), a GLAM full name, the corresponding Wikimedia Commons category, and the URL of an image representing the GLAM.

The pipeline extracting file usage statistics is set to run every 5 minutes. You can check its logs by reading the file /var/log/cassandra/etl.log. When this process is completed, the GLAM is ready to be shown to users. However, the list of available GLAMs is updated by the web application every hour. If you don't want to wait, you can manually restart the tool.

The views and the suggestions are still empty at this point because they are populated by two processes that are set to run every night. You can customize the timings by editing the crontab available at /etc/cron.d/cassandra. They both write their logs in the directory /var/log/cassandra. If you want, you can also run them manually, but be advised that the view pipeline may be slow the first time. The default setting is to load the views of the last 10 days only. If needed, you can edit this value by modifying the GLAM metadata in MongoDB. You will need to create a field min_date associated with a string like "2015-01-01".

From the "Control Panel" in the "settings" page, it is possible to upload a new owner logo, set a custom home slogan and change the default language. To apply these settings it is necessary to restart the tool with the "update tool" button.

Localization

The Cassandra tool can be localized in any language. Currently supported languages are available in the directory app/locales. An effective way to add a new language is to commit in this repository the corresponding file using the same format of the en language.

For managing translations in a more user-friendly way, it is possible to rely on the Pontoon tool created by Mozilla Foundation. A public instance of Pontoon for translating Cassandra is available at https://pontoon.wikimedia.swiss.

Create a new translation user

New users can only be created by an administrator from the interface available at /a/auth/user/.

Add a brand new language

A new language can be added by editing the project settings available at /admin/projects/cassandra-glam-tools/. Move the desired language from “Available” to “Localizable”, then save the project. Finally, it is necessary to synchronize the project as described in the following section.

Translation into production

As an admin user you can decide to send the current translation corpus into production by clicking on the "sync" grey button at the end of the project settings page. If everything went well, you will see a new commit on the GitHub repository, like this. The translation is published, now you need to eventually update your Cassandra installation. You can update the Cassandra tool with the "update tool" button in Cassandra "Control Panel" or follow the steps described in the "Update" section.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].