All Projects → ooni → pipeline

ooni / pipeline

Licence: BSD-3-Clause License
OONI data processing pipeline

Programming Languages

python
139335 projects - #7 most used programming language
PLpgSQL
1095 projects
Smarty
1635 projects
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to pipeline

datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+8.33%)
Mutual labels:  big-data, data-pipeline
Opendata.cern.ch
Source code for the CERN Open Data portal
Stars: ✭ 411 (+1041.67%)
Mutual labels:  big-data, open-data
hadoop-data-ingestion-tool
OLAP and ETL of Big Data
Stars: ✭ 17 (-52.78%)
Mutual labels:  big-data
dm tomatrixled
Display (real-time) public transport departures using Raspberry Pi and LED matrices
Stars: ✭ 17 (-52.78%)
Mutual labels:  open-data
lens
Mirror of Apache Lens
Stars: ✭ 57 (+58.33%)
Mutual labels:  big-data
predictionio-template-attribute-based-classifier
PredictionIO Classification Engine Template (Scala-based parallelized engine)
Stars: ✭ 38 (+5.56%)
Mutual labels:  big-data
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-63.89%)
Mutual labels:  big-data
vxquery
Mirror of Apache VXQuery
Stars: ✭ 19 (-47.22%)
Mutual labels:  big-data
Openbooks
An online reopository to share books. Created in the loving memory of the internet's own boy Aaron swartz, on his birthday.
Stars: ✭ 45 (+25%)
Mutual labels:  open-data
spark-acid
ACID Data Source for Apache Spark based on Hive ACID
Stars: ✭ 91 (+152.78%)
Mutual labels:  big-data
serverless-data-pipeline-sam
Serverless Data Pipeline powered by Kinesis Firehose, API Gateway, Lambda, S3, and Athena
Stars: ✭ 78 (+116.67%)
Mutual labels:  data-pipeline
network-pipeline
Network traffic data pipeline for real-time predictions and building datasets for deep neural networks
Stars: ✭ 36 (+0%)
Mutual labels:  data-pipeline
AverageShiftedHistograms.jl
⚡ Lightning fast density estimation in Julia ⚡
Stars: ✭ 52 (+44.44%)
Mutual labels:  big-data
user-research
🎤 Entretiens utilisateurs pour la compréhension des usages 🤓 réduction des frictions entre utilisateurs et données.
Stars: ✭ 14 (-61.11%)
Mutual labels:  open-data
open-retractions
‼️ 📄 🔍 an API and web interface to check if a paper has been retracted
Stars: ✭ 43 (+19.44%)
Mutual labels:  open-data
NiFi-Rule-engine-processor
Drools processor for Apache NiFi
Stars: ✭ 34 (-5.56%)
Mutual labels:  big-data
udata-gouvfr
Skin and customization for the French opendata portal based on udata. This project is not maintained anymore. Consider using https://github.com/etalab/udata-front as an alternative.
Stars: ✭ 24 (-33.33%)
Mutual labels:  open-data
hack-the-traffic
A Transportation-Themed Hackathon hosted by the City of Austin and the UT Center for Transportation Research
Stars: ✭ 19 (-47.22%)
Mutual labels:  open-data
ibmpairs
open source tools for interaction with IBM PAIRS:
Stars: ✭ 23 (-36.11%)
Mutual labels:  big-data
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+208.33%)
Mutual labels:  big-data

OONI backend

Welcome. This document describes the architecture of the main components of the OONI infrastructure.

The documentation is meant for core contributors, external contributors and researcher that want to extract data or reuse software components in their own projects.

This file is rendered here

You can also explore the documentation tree

Table of contents

[TOC]

Architecture

The backend infrastructure provides multiple functions:

  • Provide APIs for data consumers
  • Instruct probes on what measurements to perform
  • Receive measurements from probes, process them and store them in the database and on S3

Data flow

This diagram represent the main flow of measurement data

blockdiag { Probes [color = "#ffeeee"]; Explorer [color = "#eeeeff"]; Probes -> "API: Probe services" -> "Fastpath" -> "DB: fastpath table" -> "API: Measurements" -> "Explorer"; }

Each measurement is processed individually in real time.

Components: API

The API entry points are documented at apidocs

Measurements

Provide access to measurements to end users directly and through Explorer.

Mounted under /api/v1/measurement/

The API is versioned. Access is rate limited based on source IP address and access tokens due to the computational cost of running heavy queries on the database.

Sources

Probe services

Serves lists of collectors and test helpers to the probes and receive measurements from them.

Mounted under /api/v1/

Sources

Private entry points

Not for public consumption. Mounted under /api/_ and used exclusively by Explorer

Sources

Fastpath

Documentation

Database

Operations

Build, deploy, rollback

Host deployments are done with the sysadmin repo

For component updates a deployment pipeline is used:

Look at the Status dashboard - be aware of badge image caching

Use the deploy tool:

# Update all badges:
dep refresh_badges

# Show status
dep

# Deploy/rollback a given version on the "test" stage
deploy ooni-api test 0.6~pr194-147

# Deploy latest build on the first stage
deploy ooni-api

# Deploy latest build on a given stage
deploy ooni-api prod

Adding new tests

Update database_upgrade_schema

ALTER TYPE ootest ADD VALUE '<test_name>';

Update fastpath by adding a new test to the score_measurement function and adding relevant integration tests.

Create a Pull Request

Run fastpath manually from S3 on the testing stage see: rerun fastpath manually

Update the api

Adding new fingerprints

TODO

API runbook

Monitor the API and fastpath dashboards.

Follow Nginx or API logs with:

sudo journalctl -f -u nginx --no-hostname
# The API logs contain SQL queries, exceptions etc
sudo journalctl -f --identifier gunicorn3 --no-hostname

Fastpath runbook

Manual deployment

ssh <host>
sudo apt-get update
apt-cache show fastpath | grep Ver | head -n5
sudo apt-get install fastpath

Restart

sudo systemctl restart fastpath

Rerun fastpath manually

Run as fastpath user:

ssh <host>
sudo sudo -u fastpath /bin/bash
cd
fastpath --help
# rerun without overwriting files on disk nor writing to database:
fastpath --start-day 2016-05-13 --end-day 2016-05-14 --stdout --no-write-msmt --no-write-to-db
# rerun without overwriting files on disk:
fastpath --start-day 2016-05-13 --end-day 2016-05-14 --stdout --no-write-msmt
# rerun and overwrite:
fastpath --start-day 2016-05-13 --end-day 2016-05-14 --stdout --update

The fastpath will pull cans from S3. The daemon (doing real-time processing) can keep running in the meantime.

Progress chart

Log monitoring

sudo journalctl -f -u fastpath

Monitoring dashboard

https://mon.ooni.nu/grafana/d/75nnWVpMz/fastpath-ams-pg?orgId=1&refresh=5m&from=now-7d&to=now

Analysis runbook

The Analysis tool runs a number of systemd timers to monitor the slow query summary and more. See https://github.com/ooni/pipeline/blob/master/af/analysis/analysis/analysis.py

Manual deployment

ssh <host>
sudo apt-get update
apt-cache show analysis | grep Ver | head -n5
sudo apt-get install analysis=<version>

Run manually

sudo systemctl restart ooni-update-counters.service

Log monitoring

sudo journalctl -f --identifier analysis

Monitoring dashboard

https://mon.ooni.nu/grafana/d/75nnWVpMz/fastpath-ams-pg?orgId=1&refresh=5m&from=now-7d&to=now

Deploy new host

Deploy host from https://cloud.digitalocean.com/projects/

Create DNS "A" record <name>.ooni.org at https://ap.www.namecheap.com/

On the sysadmin repo, ansible directory, add the host to the inventory

Run the deploy with the root SSH user

./play deploy-<foo>.yml -l <name>.ooni.org --diff -u root

Update prometheus

./play deploy-prometheus.yml -t prometheus-conf --diff
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].