All Projects → whosonfirst-data → whosonfirst-data

whosonfirst-data / whosonfirst-data

Licence: other
Who's On First is a gazetteer of places.

Projects that are alternatives of or similar to whosonfirst-data

acl19 subtagger
Code for ACL '19 paper: Towards Improving Neural Named Entity Recognition with Gazetteers
Stars: ✭ 33 (-90.03%)
Mutual labels:  gazetteer
lucene-geo-gazetteer
Uses Apache Lucene, OpenNLP and geonames and extracts locations from text and geocodes them.
Stars: ✭ 34 (-89.73%)
Mutual labels:  gazetteer
CLAVIN-NERD
Stanford NLP Implementation of the CLAVIN LocationTagger
Stars: ✭ 22 (-93.35%)
Mutual labels:  gazetteer
gazetteer
OSM ElasticSearch geocoder and addresses exporter
Stars: ✭ 93 (-71.9%)
Mutual labels:  gazetteer
CLAVIN-rest
A Spring Boot microservice that serves the CLAVIN (https://github.com/novetta/CLAVIN) library for geo rectifying locations mentioned in text.
Stars: ✭ 16 (-95.17%)
Mutual labels:  gazetteer
linked-places-format
Linked Places format is used to describe attestations of places in a standard way, primarily for linking gazetteer datasets.
Stars: ✭ 54 (-83.69%)
Mutual labels:  gazetteer
GeoParser
Extract and Visualize location from any file
Stars: ✭ 48 (-85.5%)
Mutual labels:  gazetteer
Tangram
WebGL map rendering engine for creative cartography
Stars: ✭ 1,964 (+493.35%)
Mutual labels:  mapzen
android
Where you can find everything Android from Mapzen
Stars: ✭ 106 (-67.98%)
Mutual labels:  mapzen

whosonfirst-data

Disclaimer

As of May 2019, the whosonfirst-data repository has split into per-country repositories. You can read more about that change here. While we still track all issues in this repository, the data itself will live in the per-country repositories for the foreseeable future.

Per-country repositories have the following repository naming convention:

whosonfirst-data-admin-{2-char country code}

Meaning administrative data for Mexico, for example, would live in the following repository:

whosonfirst-data-admin-mx

At the bottom of this README, you will find a full list of per-country repositories.

whosonfirst-data

Who's On First is a gazetteer of places. Not quite all the places in the world but a whole lot of them and, we hope, the kinds of places that we mostly share in common.

A gazetteer is a big list of places, each with a stable identifier and some number of descriptive properties about that location. An interesting way to think about a gazetteer is to consider it as the space where debate about a place is managed but not decided. We call our gazetteer "Who's On First" (or sometimes "WOF" for short).

According to Wikipedia, Who’s on First:

...is a comedy routine made famous by Abbott and Costello. The premise of the
sketch is that Abbott is identifying the players on a baseball team for
Costello, but their names and nicknames can be interpreted as non-responsive
answers to Costello's questions. For example, the first baseman is named "Who";
thus, the utterance "Who's on first" is ambiguous between the question ("Which
person is the first baseman?") and the answer ("The name of the first baseman is
'Who'"). "Who's on First?" is descended from turn-of-the-century burlesque
sketches that used plays on words and names. Examples are "The Baker Scene" (the
shop is located on Watt Street) and "Who Dyed" (the owner is named Who). In the
1930 movie Cracked Nuts, comedians Bert Wheeler and Robert Woolsey examine a map
of a mythical kingdom with dialogue like this: "What is next to Which." "What is
the name of the town next to Which?" "Yes." In English music halls (Britain's
equivalent of vaudeville theatres), comedian Will Hay performed a routine in the
early 1930s (and possibly earlier) as a schoolmaster interviewing a schoolboy
named Howe who came from Ware but now lives in Wye.

Which sort of sums up the “problem” of geo, nicely. It might be easier, perhaps, if we all understood and experienced the world as coordinate data but we don’t, so the burden of “place” and its many meanings is one we trundle along with to this day.

Our gazetteer is absolutely not finished – both in terms of data coverage as well as data quality – so, in the near-term, you should adjust your expectations accordingly when you approach the data. We are releasing the data now because we believe it is important not just to articulate our goals and intentions around the project but also to back them up with tangible proofs.

Learn more about the Who’s On First data model over at https://whosonfirst.org/docs/.

First Principles

The gazetteer starts from a series of first principles:

Who's On First has an opinion

It is important that Who's On First have an opinion not about any one place but rather about the nature of place itself. It is important for us to know and understand the boundaries of our project in order to know what the project is for and, critically, what the project is not.

Leave as many decisions as possible to the "edges"

The world is a complicated place and we would like the gazetteer to be a project that can support, or act as a scaffolding for, the sometimes contradictory opinions that people have about it. We aim to leave as much meaning or inference, as we can, about a place to individual users and applications. How this will manifest itself in concrete terms remains to be seen but this is a goal we have set for ourselves.

Portability

The canonical source for a place is a text file, specifically GeoJSON with a unique 64-bit numeric ID. This is because all computers speak "text files" and "numbers". Text files can be inspected or updated in any old text editor. Text files can be printed. Numbers are fast and cheap for databases to index.

We use text files because our primary concern for the data is: Ease of use, robustness and portability over time. On measure, the benefits of plain old text files outweigh both the costs and in many cases the benefits of other formats.

Google's Protocol Buffers for example are awesome but require that you install a whole lot of Google on your computer in order to use them. ESRI's Shapefiles are equally awesome and their ubiquity and longevity is a testament to their utility but they too require bespoke applications for even the most trivial of updates.

That does not mean that plain text or static files are necessarily the optimal choice for delivery or distribution. We will account for that on a case-by-case basis. If we need to pre-process all the data into a smaller and nimbler format for a specific use-case then we will, but you will always be able to access the data as simple text files.

GeoJSON

We use GeoJSON as the primary exchange format for the gazetteer for two interconnected and complementary reasons:

  • It is structured data with the least amount of markup today. If someone creates another markup language with even less scaffolding we might use that instead but for now GeoJSON is a good happy medium.

  • There are lots of tools for working with GeoJSON and, importantly, for converting it into all the other formats that different people use.

Some Very Very (Very Very) Important Caveats

Who’s On First is a work in progress

This means a few things:

  1. Some (maybe even a lot) of the data will be wrong.

  2. Some things are missing. Some things are missing in a known unknown kind of way in which case they’ll be addressed shortly. Some things may still be missing in an unknown unknown kind of way in which case they’ll be addressed as the errors become apparent.

  3. Some (probably most) of the data will change in some way, if only to account for #1.

  4. We have not formalized or finalized the tools for updating all the ancestors or dependencies of a record when that record is updated. This means that in the short-term it is possible there will be inconsistencies between a record and its relations. We’ll get there.

The purpose of releasing the data now is not to sound the trumpets and herald a new dawn of perfect data but rather to give substance to everything we’ve been talking about and to have a meaningful dataset with which to prove or disprove those assumptions and to work through the practicalities of working with that data.

If you don’t have the time or the temperament (personally or institutionally) to deal with a little bit of on-going bad craziness as we work through the issues diving in to the data now is probably premature. We intend to continue working in public and discussing the project openly so keep an eye on the blog and we’ll let you know as things improve.

Git and GitHub

Don’t get too attached to working with or managing Who’s On First data in GitHub (or Git in general). We haven’t quite figured out what the best way of both distributing the Who’s On First data and of accepting corrections or suggestions from community.

Even though the nice people at GitHub continue to do excellent work at making Git easier for a broader population to use, the reality remains that Git is a significant barrier to participation for many people. Absent a more formal decision about an alternative GitHub at least allows us to point in the general direction of:

  • An open and readily distributed dataset that people can download and work with

  • A way for people to contribute corrections (and general nuance) about a place

  • A way for us to be able to do everything above while still assuring us a measure of authority around the assertions we make about the data

  • Also a way for us to think about how and where we store an audit trail (of sorts) for updates to a place

Git and large files

We have started using git-lfs for managing large files. For example, the record for New Zealand which contains a very very very very very detailed coastline is fast approaching the 100MB filesize limit for any individual file on GitHub.

You can see the current list of files being managed by invoking the git lfs ls-files command, like this:

$> cd /usr/local/mapzen/whosonfirst-data
$> git lfs ls-files
65ccc4825e * data/856/333/45/85633345.geojson

When you clone this repo the files (managed by git-lfs) only contain metadata, like this:

$> cat data/856/333/45/85633345.geojson
version https://git-lfs.github.com/spec/v1
oid sha256:65ccc4825e65c30f00fcebf1f3d57f4385f18a47e3c5e524114a67050186ae48
size 71879893

In order to fetch the file itself you will need to run git lfs fetch and then git lfs checkout. Because computers... but anyway, like this:

$> git lfs fetch
Fetching master
(1 of 1 files) 68.54 MB / 68.55 MB                                                                                               

$> cat data/856/333/45/85633345.geojson
version https://git-lfs.github.com/spec/v1
oid sha256:65ccc4825e65c30f00fcebf1f3d57f4385f18a47e3c5e524114a67050186ae48
size 71879893

$> git lfs checkout
(1 of 1 files) 68.55 MB / 68.55 MB                                                                                               

$> cat data/856/333/45/85633345.geojson
{
  "id": 85633345,
  "type": "Feature",
  "properties": {
    "edtf:cessation":"u",
    "edtf:inception":"u",
    "geom:area":29.187792061074827,
    "geom:bbox":"166.426148,-47.289992,178.577244,-33

Woosh! We're still working through the details on this so suggestions, tips and (gentle) cluebats are welcome.

Theory (or "the even-longer version")

Where appropriate we have moved the theory (and sometimes history) around decisions for specific Who's On First properties in to dedicated GitHub repositories. They are:

whosonfirst-dates

whosonfirst-geometries

whosonfirst-names

whosonfirst-placetypes

whosonfirst-properties

whosonfirst-sources

whosonfirst-tests

Blog posts and related musings

All of the blog posts can be found over here: https://whosonfirst.org/blog/.

Practice

The spelunker

There is a read-only "spelunker" for viewing Who's On First online at:

https://spelunker.whosonfirst.org/

Venues

For the time being Who's On First maintains separate repositories with venues and points-of-interest. The starting point for venue data is:

https://github.com/whosonfirst-data/whosonfirst-data-venue-*

Note: The * should be replaced with country code (ex: ca) and/or country code and region code (ex: us-ny).

Repository examples:

License

Crediting Who's On First is recommended and linking back to this License is required.

Data from Who's On First. License.

The Who's On First dataset is both original work and a modification of existing open data. Some of those open data projects do require attribution. We have listed some sources below.

When we source other open data projects we make best effort to indicate them (e.g.: 'src:geom':naturalearth) and we also include the original source's properties prefixed with the following names spaces:

See an up-to-date list of sources here.

Please notify us if you believe that an open data project has not been properly noted.

Our original work is generally indicated with properties prefixed with wof or is not prefixed (like name).

Remember, some sources require attribution, some do not. Mapzen's original work, including the format and structure that allows Who's On First to operate, is made available under the Creative Commons Zero designation, and a shout out would be lovely.

Read the full License file for more details per data source.

Caveats and "known knowns"

We've add a separate document called README.KNOWN.KNOWNS.md that lists the current state of known knowns and other gotchas you might encounter working with the Who's On First data.

See also:

Repositories

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].