Confidential Survey (v 0.2.1)

This is an application for gathering responses from confidential surveys in a way that doesn't result in a large table of sensitive records.

The basic idea is to not store individual form responses as records but instead only use the survey response just to increment the appropriate counters. This allows us to derive the statistics we want to ultimately measure without assembling a large database of private responses. This principle of collecting only the minimum amount of information is also known as Datensparsamkeit, which is just a cool word to say.

So, if we had a survey on ice cream and we wanted to ask employees:

Do you like ice cream? (Yes/No/Prefer Not To Answer)
What flavors do you like? (Chocolate/Vanilla/...)
What toppings do you want on your sundae? (Sprinkles/Hot Fudge/..)
What is your favorite brand? (Fill in the Blank)

And so on, we could classify the types of questions here among several distinct types to start with:

exclusive allow only one choice from the available options
exclusive-combo allow people to select multiple choices but record the exact value if they select a single one or combination if they pick more than one.
multiple record each choice picked by a user
freefrom accept freeform text

A survey about ice cream is admittedly a dumb example. It's something you could create with an existing public service like SurveyMonkey or Google Forms. Imagine however that we wanted to ask questions about something more confidential like employee diversity or sexual orientation. These systems all collect individual responses as database records or rows in a spreadsheet. While they are probably secure, why do I need this detailed information if I am only going to generate summary statistics anyway? Individual responses might be anonymous, but may endanger a respondent's privacy when combined together in a query. Why should I be asking people to trust me that nobody will use these records to drill down and do something awful like count how many LGBT people are in the accounting department of the NYC office? What if the data collection only allowed for pre-approved interpretations?

This program is written to automatically preserve privacy by discarding survey submissions and using them just to increment counters like this

Survey: ice-cream

like_ice_cream:yes 85
like_ice_cream:no 23
like_ice_cream:decline 5
flavor:chocolate 83
flavor:vanilla 45
flavor:strawberry 12
flavor:combination 34
toppings:sprinkles 83
toppings:coconut 7
toppings:none 83
brand:Blue Bell 43
brand:Gifford's 8

If we wanted to also drill down on the intersections between two fields, we could specify that in a configuration in advance (this system is designed to prevent such analysis after the fact)

flavor:chocolate|topping:sprinkles 47
flavor:chocolate|topping:coconut 2 and so on...

Be careful: This functionality is meant for very broad intersections like engineering/non-engineering AND gender for instance. Finer-grained intersections that span many fields and result in only a few responses could harm the privacy of individuals.

This program has the following components:

A simple single-table DB schema for storing the counters
A way to represent survey forms with YAML for easy rendering into forms
The ability to specify intersections between variables you want more detailed breakdowns of
A simple JSON API endpoint for returning the data collected to authorized administrators.

Local Development

The survey application is written as a Ruby on Rails application running on Ruby 2.3.0. Most of its libraries are available as gems that can be installed by bundler. It does use Postgresql as its database, so you will need to have that installed.

To get a local copy running

git clone [email protected]:18F/confidential-survey.git
cd confidential-survey
bundle install
bundle exec rake db:setup
bundle exec rails server
export SURVEY_ADMIN_NAME=debug
export SURVEY_ADMIN_PASSWORD=debug

Then you can go to http://localhost:3000/survey/sample-survey and you should see a survey you can fill out. If you visit an administrator-protected route, it should prompt you for the username and password set above.

Testing

bundle exec rake

should execute the tests. All tests are written in RSpec

Deploying the Application

This application is deployed on the cloud.gov PaaS which runs on Cloud Foundry. The following instructions are 18F-specific, but could easily be adapted for other Cloud Foundry instances or other web hosts.

Create the app (it's ok if the deploy fails):

cf push survey

Create the database service:

cf create-service rds shared-psql survey-psql

Set environment variables with cf set-env:

cf set-env survey SURVEY_ADMIN_NAME [username]
cf set-env survey SURVEY_ADMIN_PASSWORD: [password]

The application is currently secured in production with blanket HTTP Authentication, so you will need to set its username and password. These will also need to be set to run the app in cf ssh so we have to set this twice.

Set up the database:

cf-ssh
bundle exec rake db:migrate
bundle exec rake db:seed

Restage the app:

cf restage survey

To deploy future releases:

cf push survey

Deploying a New Survey

Surveys are implemented as YAML configuration files within the config/surveys directory of the application (here is a sample survey included in the repo). Surveys do not need to be – and probably should not be – checked into the repo.

To make a new survey live, the app (with survey file in its config/surveys) must be deployed to production. This limits the ability to create/edit surveys on the system only to the lead developer or anybody else with deploy access to the specific space. If the survey is named SURVEY_NAME.yml, the new survey form is accessible at /surveys/SURVEY_NAME
To mark a live survey as inactive – meaning that it no longer accepts responses – the developer has to edit a field in the survey's YAML configuration to be active: false and redeploy the survey.
To delete the survey form entirely, the developer can delete the survey's YAML file and redeploy. This will not remove the counts recorded for the survey from the database.

The survey name is used to key all tallies for its responses in the system. This means that changing the survey name/URL will reset all its tallies to 0 unless you rename all the old rows to use the new ID.

Access Control

The survey application supports two different modes of securing access:

One-time use tokens that can be distributed to a population (default)
A single HTTP authentication username/password shared across all users

Neither of these schemes are meant to identify specific users for a survey. The goal of these tools is merely to limit access to surveys so that they can be taken only by people who are supposed to take the survey.

Token Access

The token scheme requires the survey administrators to generate a pool of tokens for the survey. These can then be distributed out to survey participants. It is best that whoever is doing this distribution does not retain a list of which tokens are sent to which users, since that information could potentially be used by someone with database access to identify people who have not taken the survey.

To generate tokens, an administrator can send a GET or POST request to /surveys/SURVEY-NAME/token and this will generate a token linked to the survey and return a URL that can be given to a single user for taking the survey. This endpoint can be called to return a batch of tokens by appending a n= argument to the request. Here is an example of calling it on a development instance running on localhost.

curl --user ${SURVEY_ADMIN_USER}:${SURVEY_ADMIN_PASSWORD} http://localhost:3000/surveys/sample-survey/token\?n\=10

http://localhost:3000/surveys/sample-survey?token=z9OJSmzFZcKWDpXlnt1LPA
http://localhost:3000/surveys/sample-survey?token=wE-gRGcI0ayHH3Q8qW5MtA
http://localhost:3000/surveys/sample-survey?token=Hi59JzRPbXOAN9Mu2876sg
http://localhost:3000/surveys/sample-survey?token=FU7bwF29kKqcV-27lAIfCQ
http://localhost:3000/surveys/sample-survey?token=Wm-pvsfkr20y-pGALiYjuw
http://localhost:3000/surveys/sample-survey?token=FmOml8wTKJo7mHAjf_8y8A
http://localhost:3000/surveys/sample-survey?token=xKquRdHvi0YpJ2iADxpZpw
http://localhost:3000/surveys/sample-survey?token=PHPd_SW5i-AzZaIUscl13w
http://localhost:3000/surveys/sample-survey?token=iqQPTzQ21pdEaKjROb6Ozw
http://localhost:3000/surveys/sample-survey?token=C7Zg2J_1nyFpW-dWms-gNQ

Once a user uses this URL to fill out the survey, the token will be revoked and the URL will not work again. This means that the same URL should not be given to several users. The token is only used for access and does not identify a respondent in any way. There is no issue with generating many extra tokens that aren't used, and tokens can be generated at any time when a survey is active. To close access to a survey, all tokens can be revoked by an administrator.

curl --user ${SURVEY_ADMIN_USER}:${SURVEY_ADMIN_PASSWORD} http://localhost:3000/surveys/sample-survey/revoke

Tokens are generated by the SurveyToken model using Ruby's SecureRandom class for generating random tokens using system libraries for randomness and entropy. Currently, each token is a 16-byte random number meaning there is a 1 in 3.40282367x10^38 chance of guessing a token. All of this does assume the SecureRandom library has no issues that weaken random number generation.

HTTP Authentication

Alternatively, you can specify that the tool should use blanket HTTP authentication to protect the survey form. This requires you to add 2-3 fields to the survey YAML to indicate that you want to use HTTP authentication:

access:
    type: http_auth
    user: <username>
    password: <password>

This will then require HTTP authentication for users to access / submit the surveys. There are a few caveats to this approach:

It is up to you to use a sufficiently secure password.
Since the same credentials are shared across all users, there is nothing to prevent ballot-box stuffing.
Surveys must be set active: false and redeployed to disable HTTP auth-protected surveys since it does not rely on access tokens

Notes on Survey Construction

I am not a lawyer. Neither is this application. Just because you can use this program to create a survey for people like employees or students, this application doesn't grant you the legal or moral right to do so. Please consult with the appropriate people first.
Whenever possible, users should be presented with an option to explicitly decline to answer. Users always have the option of silently declining by not selecting any choice, but those rejections are simply not counted vs. an active decline
Intersections should be used sparingly and in such a way that a specific subpopulation can not be used to deanonymize survey respondents.
Administrators could conceivably forge responses/stuff ballot boxes/over-represent certain individuals by minting as many tokens as they wanted. This is not a tool for elections.

Caveats About Anonymity

This program is written to minimize the amount of information collected to help preserve the anonymity of respondents, but I can not explicitly guarantee that respondents will always be anonymous. There are a few ways in which anonymity could potentially be compromised:

If an attacker has a list of which tokens were distributed to which users, they could use this information to figure out who has NOT taken the survey unless all tokens are automatically scrubbed with the revoke_tokens request. For this reason, whomever is distributing the tokens should ideally not keep a list of who has what tokens at all, and should not share any information with an administrator who has access to the database.
If the attacker had the ability to monitor incoming requests to the application as well, they could see a specific user's responses when they were submitted with the user's token.
If the attacker has the ability to view the database, he could reverse engineer survey responses by capturing the tallies on a quick interval and looking for differences in counts. Keep your database secure.
Server logs could conceivably leak information about surveys. This application does not keep logs about submissions, but proper care should be taken to scrub logs at load balancers as well. In addition, IP addresses could be used to identify if a user has participated in the survey even if the communication and responses are secure. For maximum safety, use TOR.

Why Is There a Session Cookie?

The application will set a session cookie, which seems like something that will undermine the promises of anonymity. Unfortunately, I need to use that cookie for Rails' protection against Cross-Site Request Forgery (CSRF) with the form. Rails' form classes provide that protection automatically. The survey application emphatically does not use the session cookie for storing/retrieving any other information or any other cookies.

Security Scans

This repository uses two tools to provide a total of three types of automated security checks:

Brakeman provides static code analysis.
Hakiri is used to ensure the Rails/Ruby versions contain no known CVEs.
Hakiri is used to ensure the gems declared in the Gemfile contain no known CVEs.

All security scans are built into the test suite. bundle exec rake spec will run them. To run the security scans ad hoc:

Brakeman:

bundle exec brakeman

Hakiri for Ruby/Rails versions:

bundle exec hakiri system:scan -m hakiri_manifest.json

Hakiri for Gemfile dependency versions:

bundle exec hakiri gemfile:scan

Ignored Brakeman warnings

Sometimes Brakeman will report a false positive. In cases like these, the warnings will be ignored. Ignored warnings are declared in config/brakeman.ignore. This file contains a machine-readable list of all ignored warnings. Any ignored warning will contain a note explaining (or linking to an explanation of) why the warning is ignored.

Public domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

18F / confidential-survey

Programming Languages

Labels

Projects that are alternatives of or similar to confidential-survey