All Projects → COVID19Tracking → covid19-datafetcher

COVID19Tracking / covid19-datafetcher

Licence: Apache-2.0 license
Fetch COVID19 data published by US states.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
javascript
184084 projects - #8 most used programming language
shell
77523 projects
HTML
75241 projects

Projects that are alternatives of or similar to covid19-datafetcher

Go Tooling Workshop
A workshop covering all the tools gophers use in their day to day life
Stars: ✭ 2,683 (+8284.38%)
Mutual labels:  tooling
storybook-addon-props
React Storybook Addon to show component properties and stories into panels
Stars: ✭ 22 (-31.25%)
Mutual labels:  tooling
cljs-tooling
[DEPRECATED] Tooling support for ClojureScript
Stars: ✭ 58 (+81.25%)
Mutual labels:  tooling
spicedb
Open Source, Google Zanzibar-inspired fine-grained permissions database
Stars: ✭ 3,358 (+10393.75%)
Mutual labels:  production
CFE-Blank-Project
A blank Django Starter Project that includes Docker support.
Stars: ✭ 17 (-46.87%)
Mutual labels:  production
docker-compose-laravel
A Docker Compose setup for Laravel projects.
Stars: ✭ 23 (-28.12%)
Mutual labels:  tooling
Orchard
A fertile ground for Clojure tooling
Stars: ✭ 219 (+584.38%)
Mutual labels:  tooling
s3-concat
Concatenate Amazon S3 files remotely using flexible patterns
Stars: ✭ 32 (+0%)
Mutual labels:  tooling
PredictionAPI
Tutorial on deploying machine learning models to production
Stars: ✭ 56 (+75%)
Mutual labels:  production
React.ai
It recognize your speech and trained AI Bot will respond(i.e Customer Service, Personal Assistant) using Machine Learning API (DialogFlow, apiai), Speech Recognition, GraphQL, Next.js, React, redux
Stars: ✭ 38 (+18.75%)
Mutual labels:  production
swift-watch
Watches over your Swift project's source
Stars: ✭ 43 (+34.38%)
Mutual labels:  tooling
errors
errors with paired message and caller stack frame
Stars: ✭ 19 (-40.62%)
Mutual labels:  production
ai4prod
Ai4Prod is the first ecosystem which makes easy for any Machine Learning engineer using AI in production with C++.
Stars: ✭ 17 (-46.87%)
Mutual labels:  production
source
Source: a component library for the Guardian's Design System
Stars: ✭ 97 (+203.13%)
Mutual labels:  production
mobile-apps-article-templates
Templates for articles on The Guardian iOS and Android apps
Stars: ✭ 35 (+9.38%)
Mutual labels:  production
Bootboot
Dualboot your Ruby app made easy
Stars: ✭ 239 (+646.88%)
Mutual labels:  tooling
Daxif
A framework for automating a lot of xRM development processses. By using simple F# script commands/files one can save a lot of time and effort during this process by using Delegates DAXIF# library.
Stars: ✭ 37 (+15.63%)
Mutual labels:  tooling
covid-data-pipeline
Scan/Trim/Extra Pipeline for State Coronavirus Site
Stars: ✭ 15 (-53.12%)
Mutual labels:  tooling
fliphub
the easiest app builder
Stars: ✭ 30 (-6.25%)
Mutual labels:  tooling
analysis-flow
Data Analysis Workflows & Reproducibility Learning Resources
Stars: ✭ 108 (+237.5%)
Mutual labels:  tooling

As of March 7, 2021 we are no longer collecting new dataLearn about available federal data.


COVID-19 Data Fetchers

Fetch COVID19 data published by US states and territories.

The goal of this project is to fetch the most recent covid19 data from US states and territories, and publish them for easy consumption and use for Covid Tracking Project.

For context, the data collection project started under the following assumptions:

  1. The data to collect is always structured, and comes from APIs dedicated for this task
  2. It’s a short term fix, intended to run for a month, at most two months, but not more

Both assumptions were broken pretty quickly

TL;DR

Project TL;DR A timed trigger (set to 8 minutes) causes all source queries to be called and collected. The results are aggregated into a CSV file I publish and pushed to a Google spreadsheet. Fin.

The biggest value of this repository comes from (1) the list of state data sources and (2) the mapping that maps state-specific property name to a common terminology (e.g., T_Pos_Count to POSITIVE).

Long Version

This project started as a way to automate daily covid-19 data entry shifts, using the APIs that back the dashboards states publish.

Ideal World

Ideal World

I initially listed all ArcGIS dashboard and extracted useful layers and quries. I also created a mapping between state terminology to a common terminology, roughly matching CTPs tracked fields (e.g., T_Pos_Count to POSITIVE).

This was great, I had a list of parameterized queries to run, and a single 20-line program to query and tag everything. Everything worked reliably and quickly, but it wasn't enough.

Reality

Not all states use ArcGIS (too bad). The ones that do, might not have all the data in ArcGIS and use additional systems.

To increase coverage, I added more sources:

  • ArcGIS (The best! ckan has a better API, but ArcGIS is easier to explore)
  • CKAN (used by a couple of states, a good SQL-like API. Powers data.gov)
  • Socrata (soda, has an OK json API)
  • JSON
  • CSV (and zipped CSV)
  • Excel (xslx) files
  • HTML -- scraping the page

Reality

Pretty much everything requires custom code now (except for the states that use ArcGIS).

Data

The goal of this project is to automatically collect the different datasets we display and aggregate for the project.

There are 3 datasets we're tackling now: the main numbers for states, race-distribution data tracking and historic time series.

In the context of this project, a dataset is a collection of sources for each state that include query urls, mappings and auxiliary code (when needed). This is not the actual data being fetched, but the instructions to fetch it.

Available Datasets

  • States (states): Covid19 current state data tracking (cases, testing, deaths, etc).
  • CRDT (races): Covid19 racial data tracking
  • Historic Backfill (backfill): fetcher for time series data, to handle cases of states that update past days (continuously or one-offs).

Dataset Structure

Under the root of the project there's a folder called dataset with all the supported datasets.
The general structure is a yaml file with the specific dataset config, and a folder by the same name (for easy association) with the actual files defining the dataset.

dataset
├── {dataset_name}.yaml
└── {dataset_name}
    ├── mappings.yaml
    └── urls.yaml

We currently have 3 datasets, and this is how it looks:

dataset
├── backfill.yaml
├── backfill
│   ├── mappings.yaml
│   └── urls.yaml
├── races.yaml
├── races
│   ├── mappings.yaml
│   └── urls.yaml
├── states.yaml
└── states
    ├── mappings.yaml
    └── urls.yaml

Sometimes, there's a need to add special casing in the code (e.g., when scraping a page) and yaml files are not enough. Each dataset can define an extras module that will handle the parsing of responses when the default parsing is not sufficient.
By default, the extras module is define as fetcher.extras.${dataset.name} which results in a file by the same name as the dataset, in the fetche/extras folder.

# In the dataset config yaml file:
extras_module: fetcher.extras.${dataset.name}

# To remove the extras module when it's not needed:
extras_module: null

Code

Setting up Environment and Running the Scripts

I use conda locally and on the server that runs the periodic task. Between BeautifulSoup, Pandas and libraries to parse Excel files, it's a huge environment.

Get the code

git clone https://github.com/space-buzzer/covid19-datafetcher.git
cd covid19-datafetcher

Create Conda environment

conda env create -f environment.yml
conda activate c19-data

Run scripts

# fetch the default dataset (states) for all states
python get_my_data.py

# fetch the default dataset (states) for the specified state/sates
python get_my_data.py state=CA
# or
python get_my_data.py state=[CA,MT]

The output will be in states.csv

To fetch a different dataset, use the dataset=DATASET argument:

python get_my_data.py dataset=races

Project Structure

Publishing Flow

There are a few cron triggered workflows:

  • Fetch the repository from GitHub and fast-forward it (set to 1h now)
  • Generate the index page with source links
  • Run the script (set to 8 min now)
  • (After running) push the csv to Google Spreadsheets (code not in this repo, TBD to publish)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].