All Projects → wri → Aqueduct30Docker

wri / Aqueduct30Docker

Licence: other
add readme.md

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Alpha version, not released

Binder

Tools

not released to public yet

Data

Input data, process data and final data can all be found on AWS S3:
s3://wri-projects/Aqueduct30

The Water Risk Atlas final data (Annual, Monthly):
s3://wri-projects/Aqueduct30/finalData/Y2019M01D14_RH_Aqueduct_Results_V01

Country rankings final data:
wri-projects/Aqueduct30/finalData/Y2019M04D15_RH_GA_Aqueduct_Results_V01

FAQ:

Q:
In the file, some geo units (by string_id) have a indicator labeled as “awr”. Could you explain what that is? A:
awr in Aqueduct 3.0 stands for aggregated water risk. There are four options: tot (total), qan (quantity), qal (quality) and rrr (regulatory and reputational). In combination with an industry weighting scheme (see technical note), these represent the aggregated water risk. awr tot is also referred to as "overall water risk".

Q:
If we look at unique string_ids, why does master_geom.shp has 68,511 units, but annual_normalized.csv only has 68,365? For example, the unit (string_id: None-ALA.13_1-None) is not in the csv file.
A:
"None-ALA.13_1-None" means that it's not part of a hydroBasin nor a goundwater aquifer. It's part of the GADM level 0 (usually country) of Åland. For Åland, we don't have any country information either. We used an inner join leading to the different shapes of the data.

Q:
The number indicators each geo unit (by string_id) has are not always the same. Some of them have 14 (e.g., 434823-CHN.16_1-1626), some 13 (e.g., 296905-SAU.13_1-None), 12 (e.g., 524050-None-2096), 4 (e.g., None-AGO.2_1-2691)… Could you explain why that’s the case?
A:
This depends on data availability. The string_id uses the format hydrobasinID_GADM0ID_WHYMAPID. "None" is used when a geometry is not part of the associated geometry. The numbers that you specify are however different than what I found. For each string_id and weighting_sheme (industry), there is a maximum of 13 indicators + 3 grouped aggregated water risk + 1 total aggregated water risk score. Hence the maximum is 17.

434823-CHN.16_1-1626 (17, 10) 296905-SAU.13_1-None (16, 10) None-AGO.2_1-2691 (7, 10)

Auxiliary data:

Raster

PCR-GLOBWB 2 on S3 (Geotiff): s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_Convert_NetCDF_Geotiff_V02/output_V02 PCR-GLOBWB 2 on earthengine (ImageCollection): projects/WRI-Aquaduct/PCRGlobWB20V0/global_historical_PDomWN_month_m_5min_1960_2014 link

Tabular

on S3: s3://wri-projects/Aqueduct30/processData/Y2018M12D11_RH_Master_Weights_GPD_V02
on BigQuery: aqueduct30:aqueduct30v01.y2018m12d11_rh_master_weights_gpd_v02_v10

Workflow

Througout the readme, variables that you need to replace with your own variable are indicated in greater than and smaller than signs <variableYouNeedToReplace>

If you are not viewing this document on Github, please find a stylized version here
The coding environment uses Docker images that can be found here

this document explains each and every step for the data processing of Aqueduct 3.0. Everything is here, from raw data to code to explanation. We also epxlain how you could replicate the calculations on your local machine or in a cloud environment.

The overall structure is as follows:

  • Data is stored on WRI's Amazon S3 Storage
  • Code and versionion is stored on Github
  • The Python environment description is stored in a Docker Image
  • Coding and dat operation are done in Jupyter Notebooks

A link to the flowchart: https://docs.google.com/drawings/d/1IjTVlQUHNYj2w0zrS8SKQV1Bpworvt0XDp7UE2tPms0/edit?usp=sharing

Flowchart

Each data source (pristine data), indicated with the open cylinder on the right side, is stord on our S3 drive on the rawData folder: wri-projects/Aqueduct30/rawData

The pristine data is also copied to step 0 in the data processing folder: wri-projects/Aqueduct30/processData

Setup

A link to edit the technical setup drawing: https://docs.google.com/drawings/d/1UR62IEQwQChj2SsksMsYGBb5YnVu_VaZlG10ZGowpA4/edit?usp=sharing

Setup

Getting started

There are two options to setup your working environment:

  • Locally
  • In the cloud (recommended)

Both options are based on Docker and Jupyter. Although you might be able to do the lion's share of the data processing on your local machine, there are good reasons to work with a cloud based solution

  • mount a large harddrive to store the data. you will need appr. 300GB
  • easy to pick an appropriata instance size (number of CPU's and RAM)

There are also downsides

  • Additional security steps required
  • Account(s) needed
  • Costs

Locally

Requirements:

  • The Docker image requires approximately 12GB of Storage and is not a lightweight solution.
  • If you want to replicate the Aqueduct data processing steps, you will need approximately 300GB of disk space.

If you are on a windows machine, the standard command prompt is limited. I found it useful to install a custom application to replace the command line. conEmu

  1. install docker cummunity edition
    instructions
    For windows it requires some additional steps and might require enabling Hyper-V virtualization. There are cases in which you have to enable this in your BIOS. In case of WRI Windows 10 Dell Latitude E7250 Laptops, the following links are helpful:
    Manually enable Hyper-V
    Troubleshoot
    Adding your user to Docker

  2. Start docker
    you can check if docker is installed by typing docker -v in your terminal or command prompt. If you ever got stuck in one of the next steps or closed your terminal window it is important to understand some basic docker commands. First, you need to understand the concpet of an image and a container. You can list your images using docker images and you can list your active containers using docker ps and all your containers using docker ps -a. If your container is still running you can bash (terminal) into your container using docker exec -it <container name> bash. To shut down a container, use 'exit'. Furthermore you can delete containers using docker rm -f <ContainerName> and images using docker rmi <imageName>. I also created a couple of cheatsheets for various tools.

  3. Run a Docker Container:
    docker run --name aqueduct -it -p 8888:8888 rutgerhofste/docker_gis:stable bash
    This will download the docker image and run a container with name aqueduct in -it mode (interactive, tty), forward port 8888 on the container to the localhost port 8888 and execute a bash script. It will be helpful to understand the basics of Docker to understand what you are doing here. Docker will automatically put your terminal or command prompt in your container. It will say root@containerID instead of your normal user. You can tell if you are in a container by the first characters in you terminal. It will state something like "root@240c3eb5620e:/#" indicating you are a root user on the virtual machine named "240c3eb5620e". The code will be different in your case.

  4. Setup Security certificates:
    in your container create a certificate by running:
    openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /.keys/mykey.key -out /.keys/mycert.pem
    You are asked some quesions like country name etc. which you can leave blank. Just press return a couple of times.

  5. Clone the Git repository You have two options here: 1) Clone the Aqueduct Repository 2) Create a so-called fork of the Aqueduct Project and work in the fork. The first option requires you to be added as a collaborator in order to be able to push your edits to the repo. The latter option allows you to work independent from the official Aqueduct repo. You will need to make a pull request to have your edits incorporated in the main repo of Aqueduct3.0.

    1. Option 1) Clone original Aqueduct3.0 repository:
      While in your Docker Image (root@... $ )
      mkdir /volumes/repos (might already exist)
      git clone https://github.com/rutgerhofste/Aqueduct30Docker.git /volumes/repos/Aqueduct30Docker/
    2. Option 2) Fork repository first
      Fork repository on Github
      Learn more about how forking works here
      mkdir /volumes/repos (might already exist)
      git clone https://github.com/<Replace with your Github username>/Aqueduct30Docker.git /volumes/repos/Aqueduct30Docker/
  6. Create a TMUX session before spinning up your Jupyter Notebook server
    Although this is an extra step, it will allow you to have multiple windows open and allows you to detach and attac in case you lose a connection. tmux new -s aqueducttmux.

  7. Split your session window into two panes using crtl-b ". The way TMUX works is that you press crtl+b, release it and then press ". more info on TMUX. You can change panes by using ctrl-b o (opposite).

  8. In one of your panes, launch a Jupyter Notebook server
    jupyter notebook --no-browser --ip=0.0.0.0 --allow-root --certfile=/.keys/mycert.pem --keyfile=/.keys/mykey.key --notebook-dir= /volumes/repos/Aqueduct30Docker/ --config=/volumes/repos/Aqueduct30Docker/jupyter_notebook_config.py

  9. Open your browser and go to https://localhost:8888
    The standard password for your notebooks is Aqueduct2017!, you can change this later

  10. Congratulations, you can start running code in your browser. This tuturial continues in the section Additional Steps After Starting your jupyter Notebook server

Cloud based solution (recommended)

  1. Get familair with how to use Amazon (EC2) or Google Cloud (CE) virtual instances:
    for this I reccomend using the tutorials that are available on Amazon's and Google's websites.
    Amazon tutorial

  2. Use the specifics below when setting up you EC2 instance. If you miss one step, your instance will likely not work.

    1. In step 1) select Ubuntu Server 16.04 LTS (HVM), SSD Volume Type
    2. In step 2), if your budget allows, choose T2.Medium
      calculate costs
    3. In step 3) make sure
      1. If you are within a VPC, allow IP addresses to be set
        Auto-assign Public IP = enable
      2. Under advanced details, set user data to as file and upload the startup.sh script from the /other folder on Github.
    4. in step 4) add storage depending on the steps in the data process, we recommend setting the size to 200GB.
      calculate costs
    5. in step 5) add tags
      optionally you can set a name for your instance
    6. in step 6) Set the appropriate security rules.
      This is a crucial step. Eventually we will communicate over SSH (port 22) and HTTPS (port 443). You can whitelist your IP address or allow traffic from everywhere. As a minimum you need to allow SSH and HTTPS from your IP address. If you want to do testing with HTTP you can temporarily allow HTTP (port 80) traffic.
    7. Launch your instance
  3. Connect to your instance using SSH Connect to your instance
    For windows PUTTY is recommended, for Mac and Linux you can use your terminal.

  4. Once logged in into your system
    check if docker is installed docker version

  5. download the latest docker image for aqueduct. Check https://hub.docker.com/search/?isAutomated=0&isOfficial=0&page=1&pullCount=0&q=rutgerhofste&starCount=0 run your container
    docker run --name aqueduct -it -p 8888:8888 rutgerhofste/docker_gis:stable bash

  6. (recommended) Set up HTTPS access in your container create a certificate by running
    openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /.keys/mykey.key -out /.keys/mycert.pem and answer some questions needed for the certificate

  7. Optional: Setup SSH access keys:
    https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/
    ssh-keygen -t rsa -b 4096 -C "[email protected]"
    cat /root/.ssh/id_rsa.pub

  8. Clone your repo in a new folder
    mkdir /volumes/repos
    cd /volumes/repos
    If you setup github SSH (see above):
    git clone [email protected]:rutgerhofste/Aqueduct30Docker.git
    otherwise:
    git clone https://github.com/<Replace with your Github username>/Aqueduct30Docker.git /volumes/repos/Aqueduct30Docker/
    You might have to specify credentials.

  9. Create a TMUX session before spinning up your Jupyter Notebook server.
    Although this is an extra step, it will allow you to have multiple windows open and allows you to detach and attach in case you lose a connection. tmux new -s aqueducttmux

  10. Split your session window into two panes using crtl-b " The way TMUX works is that you press crtl+b, release it and then press ". more info on TMUX. You can change panes by using ctrl-b o (opposite).

  11. Start your notebook with the certificates
    jupyter notebook --no-browser --ip=0.0.0.0 --allow-root --certfile=/.keys/mycert.pem --keyfile=/.keys/mykey.key --notebook-dir= /volumes/repos/Aqueduct30Docker/ --config=/volumes/repos/Aqueduct30Docker/jupyter_notebook_config.py

  12. in your browser, go to: https://<your public IP address>:8888 You can find your public IP address on the overview page of amazon EC2. your browser will give you a warning because you are using a self created certificate. Do you trust your self created certificate?
    If you trust yourself, click advanced (Chrome) and proceed to the site. The current config file is password protected. I will change to something generic in the future. If you want to change this password please see this link

  13. The standard password for your notebooks is Aqueduct2017!, you can change this later

  14. Congratulations you are up and running. to make most use of these notebooks, you will need to authenticate for a couple of services including using AWS and Google Earth Engine.

Additional Steps After Starting your jupyter Notebook server

Let's check what we've done so far. You are now able to connect to a jupyter notebook server that either runs locally or in the cloud. In addition to your browser, you have an open terminal (or command prompt) window open with two TMUX panes. One is logging what is happening on your Jupyter notebook server, the other is idle but connected to you container. You can tell if you are in a container by the username and machine name in your window. It should say something like root@240c3eb5620e:. remember that you can switch panes by ctrl-b o

  1. Authenticate for AWS
    In your tmux pane type aws configure

you should now be able to provide your AWS credentials. Please ask Susan Minnemeyer if you haven't received those already.

  1. Autenticate for Google Cloud SDK
    similar to AWS, you might need Google Cloud acces.
    gcloud auth login

  2. Autenticate for Earth Engine and for earth engine (if needed, you can also do this from within Jupyter)
    earthengine authenticate

Commit to development

The docker image comes with git intalled and is linked to the following github remote branch: https://github.com/rutgerhofste/Aqueduct30Docker

in order to commit, please run a terminal from the Jupyter main page (top right corner).

Cheatsheet

you can bash into the instance using docker exec -it <container ID> bash

share repo on hub.docker docker login docker tag image username/repository:tag e.g.: docker tag friendlyhello rutgerhofste/get-started:part1 docker push username/repository:tag

Identify yourself on the server git git config --global user.email "[email protected]" git config --global user.name "Rutger Hofste"

cleanup

check containers docker ps -a

docker stop <containerID>

docker rm <containerID>

check images docker images

docker rmi <imageID>

Windows remove none images FOR /f "tokens=*" %i IN ('docker images -q -f "dangling=true"') DO docker rmi %i

Safe way: run bash on docker container and use AWS configure aws configure

us-east-1

aws configure

Copy files to volume

aws s3 cp s3://wri-projects/Aqueduct30/rawData/Utrecht/yoshi20161219/waterdemand /volumes/data/ --recursive

Using Putty and want to edit a file in nano/vim: export TERM=xterm

due to some weird bug

Earth Engine Javascript API files

The javascript files for earth engine can be added to your earth engine code editor (code.earthengine.com) by using the following URL

https://code.earthengine.google.com/?accept_repo=aqueduct30

note to self: If you make changes in the online code editor and want to push to this github, use

git clone https://earthengine.googlesource.com/aqueduct30

git pull origin

Recommended, add HTTPS Security

Run on your instance (not in docker container)

openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem

Put the private and public key in folder that matches the patch in your jupyter config file

if needed change the path in your jupyter config file

run your container

copy files to container

docker run -it -p 8888:8888 testjupyter:v01 bash

cd /usr/local/bin/

docker images -q --filter "dangling=true" | xargs -r docker rmi

Jupyter Hub

docker run -it -p 8000:8000 rutgerhofste/jupyterhub:v02 bash

clone latest git repo

Create certificates openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /.keys/jupyterhub.key -out /.keys/jupyterhub.crt

Set environment variables

Create these values: https://github.com/settings/applications/new

export GITHUB_CLIENT_ID=from_github export GITHUB_CLIENT_SECRET=also_from_github export OAUTH_CALLBACK_URL=https://[YOURDOMAIN]/hub/oauth_callback

Run jupyterhub in folder with jupyterhub_config.py

jupyterhub

Setup Github using SSH

https://jdblischak.github.io/2014-09-18-chicago/novice/git/05-sshkeys.html

cd ~/.ssh ssh-keygen -t rsa -C "[email protected]"

no passphrase default folder

cat ~/.ssh/id_rsa.pub

Add on github

git clone ssh://[email protected]:rutgerhofste/Aqueduct30Docker.git

Aqueduct Database Schema

This schema was created using draw.io. File Location: other/ERD.xml

ERD

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].