MongoDB Open Data COVID-19
This project retrieves and inserts into MongoDB the Johns Hopkins University COVID-19 dataset provided on Github.
Please read the blog post associated to this repository.
Databases and Collections
Database covid19
This database contains a few collections that have been carefully engineered to be as useful and as convenient as possible to work with.
In each of these collections:
- the keys have been renamed correctly to provide more consistency across the different collections,
- the fields have been casted into their correct types,
- the loc fields contains GeoJSON points,
- the date are ISODates,
- etc.
5 collections are available:
- metadata
- global (the data from the time series global files)
- us_only (the data from the time series US files)
- global_and_us (the most complete one)
- countries_summary (same as global but countries are grouped in a single doc for each date)
Collection metadata
This collection contains only one single document. It contains the list of all the values (obtained with mongodb distinct function) for the major fields along with the first and last dates.
{
_id : "metadata",
countries : [ "Afghanistan", "Albania", "Algeria", "..." ],
states : [ "Alabama", "Alaska", "Alberta", "..." ],
states_us : [ "Alabama", "Alaska", "American Samoa", "..." ],
counties : [ "Abbeville", "Acadia", "Accomack", "..." ],
iso3s : [ "ABW", "AFG", "AGO", "..." ],
uids : [ 4, 8, 12, ... ],
first_date : 2020-01-22T00:00:00.000+00:00,
last_date : 2020-04-24T00:00:00.000+00:00
}
Collection global
This collection contains the equivalent of what is in the 3 documents:
time_series_covid19_confirmed_global.csv
time_series_covid19_deaths_global.csv
time_series_covid19_recovered_global.csv
Each document is also joined with its associated line from the UID_ISO_FIPS_LookUp_Table.csv
file.
In this collection we have nb_entries(time_series_covid19_confirmed_global.csv) * number_days
documents.
Here is an example document:
{
"_id" : ObjectId("5ea45f2a8049cddb8cfa3822"),
"uid" : 4,
"country_iso2" : "AF",
"country_iso3" : "AFG",
"country_code" : 4,
"country" : "Afghanistan",
"combined_name" : "Afghanistan",
"population" : 38928341,
"loc" : {
"type" : "Point",
"coordinates" : [
67.71,
33.9391
]
},
"date" : ISODate("2020-01-22T00:00:00Z"),
"confirmed" : 0,
"deaths" : 0,
"recovered" : 0
}
Collection us_only
This collection contains the equivalent of what is in the 2 documents:
time_series_covid19_confirmed_US.csv
time_series_covid19_deaths_US.csv
Each document is also joined with its associated line from the UID_ISO_FIPS_LookUp_Table.csv
file.
In this collection we have nb_entries(time_series_covid19_confirmed_US.csv) * number_days
documents.
Here is an example document:
{
"_id" : ObjectId("5ea45f2b8049cddb8cfa9912"),
"uid" : 16,
"country_iso2" : "AS",
"country_iso3" : "ASM",
"country_code" : 16,
"fips" : 60,
"state" : "American Samoa",
"country" : "US",
"combined_name" : "American Samoa, US",
"population" : 55641,
"loc" : {
"type" : "Point",
"coordinates" : [
-170.132,
-14.271
]
},
"date" : ISODate("2020-01-22T00:00:00Z"),
"confirmed" : 0,
"deaths" : 0
}
Note: JHU does not provide recovered data for the US files. It's currently only available in the global files.
Collection countries_summary
This collection is calculated using the data from the global
collection. It's the same collection but the countries are grouped together into a single document for each date.
So in this collection, you will find nb_countries * nb_days
documents.
Also, because all the states are grouped into a single doc for each countries, some fields are arrays in this collection.
You can see the detailed aggregation pipeline in the 2-smart-insert.py
file in the data-import
folder.
Here is an example for France:
{
"_id" : ObjectId("5eb1e2bdeb5fc5a3a38a33f8"),
"uids" : [ 312, 175, 638, 666, 258, 250, 663, 254, 652, 540, 474 ],
"confirmed" : 169583,
"deaths" : 25204,
"country" : "France",
"date" : ISODate("2020-05-04T00:00:00Z"),
"country_iso2s" : [ "GF", "MF", "PF", "YT", "GP", "RE", "PM", "MQ", "FR", "BL", "NC" ],
"country_iso3s" : [ "SPM", "NCL", "REU", "BLM", "MAF", "MYT", "MTQ", "PYF", "GLP", "FRA", "GUF" ],
"country_codes" : [ 652, 474, 666, 175, 250, 258, 254, 663, 312, 638, 540 ],
"combined_names" : [
"Saint Pierre and Miquelon, France",
"Guadeloupe, France",
"Reunion, France",
"New Caledonia, France",
"Saint Barthelemy, France",
"France",
"Martinique, France",
"Mayotte, France",
"French Polynesia, France",
"French Guiana, France",
"St Martin, France"
],
"population" : 298682,
"recovered" : 51476,
"states" : [
"French Guiana",
"French Polynesia",
"Guadeloupe",
"Mayotte",
"New Caledonia",
"Reunion",
"Saint Barthelemy",
"St Martin",
"Martinique",
"Saint Pierre and Miquelon"
]
}
Collection global_and_us
This collection is the most complete collection in this database. This collection basically contains all the documents from the collections:
global
us_only
But with a little trick on top: the US cases are counted in both collections respectively:
- at a country level in the first one,
- and in a more detailed level (county and state) in the second one.
So to take advantages of both collections, I just removed the confirmed and deaths counts from the US documents which comes from the global
collection.
This allow me to keep track of the recovered cases in the US while also keeping track of the confirmed and deaths cases at a more detailed level. This is really the best we can do here because JHU don't reported recovered cases at a detailed level for the US.
With this trick, the count for confirmed, deaths and recovered persons for a given date is correct.
This is the collection I'm using to build my charts in my charts blog posts:
The documents in this collection are exactly the same than in the collections mentioned above.
- For a document that comes from the
global
collection:
{
"_id" : ObjectId("5ea49768865a48ecca6d5ccb"),
"uid" : 250,
"country_iso2" : "FR",
"country_iso3" : "FRA",
"country_code" : 250,
"country" : "France",
"combined_name" : "France",
"population" : 65273512,
"loc" : {
"type" : "Point",
"coordinates" : [
2.2137,
46.2276
]
},
"date" : ISODate("2020-04-24T00:00:00Z"),
"confirmed" : 158636,
"deaths" : 22245,
"recovered" : 43493
}
- For a document that comes from the
us_only
collection:
{
"_id" : ObjectId("5ea4976b865a48ecca70df4d"),
"uid" : 84042101,
"country_iso2" : "US",
"country_iso3" : "USA",
"country_code" : 840,
"fips" : 42101,
"county" : "Philadelphia",
"state" : "Pennsylvania",
"country" : "US",
"combined_name" : "Philadelphia, Pennsylvania, US",
"population" : 1584064,
"loc" : {
"type" : "Point",
"coordinates" : [
-75.1379,
40.0034
]
},
"date" : ISODate("2020-04-24T00:00:00Z"),
"confirmed" : 11877,
"deaths" : 449
}
- For the special document that comes from the
global
collection and represents the US entire country (the one with the trick):
{
"_id" : ObjectId("5ea49768865a48ecca6d84d1"),
"uid" : 840,
"country_iso2" : "US",
"country_iso3" : "USA",
"country_code" : 840,
"country" : "US",
"combined_name" : "US",
"population" : 329466283,
"loc" : {
"type" : "Point",
"coordinates" : [
-100,
40
]
},
"date" : ISODate("2020-04-24T00:00:00Z"),
"recovered" : 99079
}
Just pay attention: the documents which come from the US detailed source don't have a recovered
field because JHU doesn't provide this data. Only the documents (one for each date) that represents the US at a country level contain the recovered
field.
Also JHU doesn't provide the recovered count for all the countries and states in the global files.
In this collection, we have (nb_entries(time_series_covid19_confirmed_global.csv) + nb_entries(time_series_covid19_confirmed_global.csv)) * number_days
.
Dabatase covid19jhu
This database contains the raw CSV files imported with the mongoimport tool.
This database is updated by the 1-mongoimport-everything.sh
in the data-import
folder.
This script imports all the files matching these names with wildcards:
- jhu/csse_covid_19_data/csse_covid_19_daily_reports/*.csv
- jhu/csse_covid_19_data/csse_covid_19_daily_reports_us/*.csv
- jhu/csse_covid_19_data/csse_covid_19_time_series/*.csv
- jhu/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv
And it creates these collections:
- Errata
- UID_ISO_FIPS_LookUp_Table
- daily
- daily_us
- time_series_covid19_confirmed_US
- time_series_covid19_confirmed_global
- time_series_covid19_deaths_US
- time_series_covid19_deaths_global
- time_series_covid19_recovered_global
Note: These collections are not clean and the schema designs are not great to work with because they come from raw and flat CSV files but at least they contain exactly what JHU is delivering.