Scott Morehouse (ESRI) - A Conceptual Model for Placename Geography in GIS

Profile                Slides

I'll try to generate some discussion here through controversy and terminology - those are always good things to get people stirred up about. My role at ESRI is directing software development, and my interest in this workshop is really coming up with a conceptual framework for the construction of a gazetteer, and the integration of gazetteer services into geographic information systems - how we can build such a conceptual framework and build software tools which implement that, and which our users can apply in their various areas. So that leads me to an interest in the generic solution or a broad solution to the problem of gazetteer services, as opposed to focusing on a particular type of gazetteer or web site or problem domain.

You can start out from an information science point of view. It's just trivial - isn't a gazetteer just the fifteen pages in the back of the atlas? Isn't that just a table with the latitude and longitude and the name, and isn't it just a SQL statement? Isn't that all there is? But then you start down this path and, oh yeah, places have more than one name, and they change in time, and where did that name come from, and so on, and suddenly the simplistic view of the gazetteer goes away and you start adding bags on the side of this gazetteer concept. I find that when I start adding bags onto the side of a concept, we kind of have the wrong concept. If you end up with a lot of special cases in your thinking, there is something wrong. Maybe this is the wrong way to think about gazetteers, or the wrong starting point.

I don't know how many of you were trained in geography, but there was a really boring class that we had to take called the "Philosophy of Geography", where we study about what is where and why, and people would write long books about this type of thing. I never used it, but I actually thought about this kind of stuff in preparing for this meeting. It's about the notion of place. There are geographers who are obsessed about place, and that place is something special. It's not just a hole that is filled by objects; the place has its own identity apart from features or things that exist in space. The point that I want to make here is that places are not things located in space. A place is a named or described aspect of space. The main point that I want to make conceptually is that places are not features. It's not like God created 40 million places out there and we just went around and named them all, and now it's just a matter of gathering that up in a database somewhere. There are an infinite number of places in space, and there are many ways that we can use to describe those places in space. That's one point about places. There are different ways to talk about places. Places can be the space occupied by a named thing - that's what we typically think about with gazetteers. Like Lake Ontario is a place - is a thing. That's a case where a place is a feature. In fact, when I looked up "place" in the Webster's Dictionary there was a great quote from Kant or someone

(Break in text between tapes)

…about the township range system, for example, and that is a syntactic context, or the same goes for postal addresses which actually have a syntactic context to them.

I think we've already talked about the fact that places have fuzzy locations, like the Rocky Mountains. It is very important that we can do things like answer queries about  "where are the Rocky Mountains?" or at least guide people into that. But this is a real interesting challenge.

The other thing is the place name itself or the place description itself is fuzzy. There are alternate ways or representations of the same description - different spellings of the same place name is one example. Here's an example using different combinations of abbreviations and so forth. So here's some examples of places: Rocky Mountain National Park - that's what gazetteers now are really good at, you know, it's a well-defined administrative object. It's a feature class "National Park" or "public land entity" or whatever, and we can type this into a gazetteer or server and actually get there pretty easily. A postal description or a postal address is a place; Los Angeles, California, that's talking about a place, and it may perhaps be an administrative hierarchy, not a postal hierarchy. Zip codes - everyone knows what 90210 is - it's a place. Geographers like to say that places are more than just locations, they're sort of states of mind. So there's a 90210 place, I suppose. There are geographic coordinates in latitude and longitude, and in UTM or in plane systems. There's interesting places like river-naming conventions: the west fork or the north fork or the west fork of the Kern River is a place, or the third fork, the third north fork - that's a kind of place that again has a syntax, has a set of reference information. Bellows Falls, that's an interesting place because it's quite ambiguous, you know, is this an object of type "waterfall", or is this an administrative area, or does this just happen to be a landmark that people call Bellows Falls, even though there's no water running over it?

Township and range legal description is a very important place because all kinds of mining, legal and collection information are gathered this way. It's very important for management of land. I think Tom was saying that chainage or mileage up and down rivers is often found from landmarks. All of EPA's permitting operations and so forth are managed, using place names like this - all kinds of connections are.

"42", that's a great place. That was the answer to the purpose of the universe, I think, and it's also the FIPS code for Wyoming or West Virginia or one of those things. There are also well-defined coding standards that deal with place names. Descriptions involving things like "near" or "south west of" are very important, and then I'd have Mississippi River and Mississippi River Basin in here, just because rivers and river basins and named things like that are very complicated. The Mississippi River is not necessarily all of the lines that drain through the Mississippi Delta. People talk about the Mississippi River, maybe they're talking about navigating the main channel of the Mississippi River, and the Mississippi River Basin is quite distinct also. And there are whole placename hierarchies and descriptive location hierarchies based on basins, where the hierarchy of a drainage basin has been established. Tom also talked about traverses, place names based on traverses. He had real examples. I made up this one up, you know, you shot this thing the 14th day of the expedition. You want to be able to take that and translate it into a location.

Telephone numbers are very importance place names now; all of the emergency response systems are based on telephone numbers using the 911- and now with the cellular 911 - and there are people that have huge operational gazetteer systems for registering the coordinates of telephone numbers. Any time a telephone company allocates a phone, gives you a phone number, there's a transaction going to some database somewhere with the street address for that phone number.

Interstate, this is about mileposts - Interstate 10, milepost 233.5 - that's where accidents happen or things get washed out, or signs are put on roads. Then there's cross-streets and so on.

So all of these are types of place names, or placename geographies that people use every day, and how do we think about these things? Is this the role of the gazetteer, to try to build a conceptual model for all of this, or is there a more specialized focus that we have for gazetteers, particularly in administrative names or postal names or whatever?

So one of the interesting questions that I wanted to learn out of this meeting is - how far does this community want to take this and still call it a gazetteer, and what is the role of the gazetteer?

What I wanted to do here is to introduce some terms and stir things up a little bit in terms of putting together a contextual or conceptual framework for how places get located, or how this transformation goes from named location to more explicit location.

So here's a model of it. It starts with the big blue box that might be called the gazetteer; we call it a "locator" in our system architecture. That takes as an input an address which is an object like one of these things, a descriptive set of text that describes a place, city, state, Zip, whatever. That address goes into the locator, and what you want to get out of that is a location. By location I mean the result of the locating function, and that consists of the coordinates of that address, plus the necessary contextual information you might need to interpret that. For example, you type in a street address, 380 New York Street, and you get back a latitude-longitude, and then it might also tell you what town it's in, or what Zip code it's in, or what river basin it's in, some context for that. So location is a little bit broader concept than just a coordinate. So the address or the place description goes in, there's a parsing process that takes that description and breaks it into pieces. It separates out that this part must be the township number, and this part must be the range or this part must be a street prefix and that part must be a Zip code, or whatever. It breaks it into a set of components that are particular to the type of addresses that that locator knows how to work with.

Those components are then standardized; by that I mean, pieces of that name are standardized, using thesauri: "Blvd" is boulevard and "Cir" is circle and "N" probably means North in the case of street standardization, street address standardization. That's based on some standard address components. Now we have a set of components, and we can use those components to actually construct a search into a reference file of named places, of named things and relationships. So in the case of street address matching where you would have a postal address, you'd go and search by Zip code and you'd search by street name and you'd search by address range and you pull those things together. As a result of a search, you end up with a list of candidates, candidate locations - that address could mean this or it could mean that, it could mean the other thing.

Another example of a locator search would be searching a database of section corner coordinates to establish coordinates for an object based on a public land survey object. So that's this database of named things, it's closely related to the searching job.

Once we get a set of candidates- in some cases, in the simple cases, there's just one answer and it's not a big deal - but in a lot of cases there's ambiguity in descriptions and there's ambiguity in the database of named things and there's ambiguity in the addresses, there's more than one possible fit, because of misspellings or whatever. Given that set of candidates, we then assign them scores, and we choose the highest score and present that one as our best guess. In the case of automated locator services, you don't want to have the user go through a long rigmarole effort for each and every record they're trying to address match or locate.  You'd like it to say, this is so bang on, I'm just going to do it automatically and move on to the next one.

In other cases it is so vague that you really do have to bring up a map and let them draw on it and explore the information, but the goal is to be able to score the candidates and then pick the location. So let me give an example from the Santa Barbara gazetteer. For example, if I were to type in "Los Angeles" in the Santa Barbara gazetteer, I'd get back about 200 or 75 results. Apparently in Bolivia and in Brazil, they like to name things Los Angeles, so they're all there and they're in alphabetical order or they're in record accession number order, or whatever.

If I go through that and say, well, that is dealing with things at this level. If I were to do that same problem in that context, I would type in "Los Angeles comma California".  The parser would know there are two place name components, the Los Angeles place name and the California component, and there's an implied containment within that. It would standardize CA or whatever to be California, and then we'd search for both Los Angeles, we'd search for California and we'd say OK, there's these eight things that hit both, and then we'd score them and then we'd assign the highest score, perhaps because maybe it knows that we're dealing with the administrative style of address, it assigns the highest score to the Los Angeles city, rather than to the Los Angeles high school or gymnasium or whatever and returns that as our preferred location.

I guess our thinking in terms of building this into the GIS framework is that we can generalize this pattern or this framework, and support a variety of different location services or gazetteer services on this, and there is a lot of specialized logic or customization that needs to take place, so we talk about there being many types of these blue boxes, many types of locators. I have a US street locator, I have a township range locator, or whatever, a 1:10 million international atlas locator, and each of those locators has some specialized logic that it uses for parsing and searching and so on. The work on gazetteer standards can exist at a couple of different levels; one is, we can talk about defining the interfaces from the blue box to the outside. The other one is, we can talk about defining these interfaces, because there really is a value in having standard databases of named things, independent of whether you're using those named things in the context of just an atlas-type gazetteer or in the context of a postal one. An example we have is that we do international geo-coding, where you type in a street address and you're supposed to figure out where that is in the world, and one of the things that it has to do is to parse French-style addresses and so on.

At some point, it goes down and it wants to deal with all these variant spellings of Cologne, Germany. It would be nice if there were a standard set of various spellings of Cologne that could then be used or embedded in a variety of different locator services. A postal locator service might use that database of named things for postal matching, and an art history locator, or a meets and bounds, or a "40 miles west of Cologne" style locator could also use that same database of named things, and that I guess is the gazetteer as thesaurus kind of notion. If you think of this thing as a fancy thesaurus for named things, and then if we look at the gazetteer as just taking these addresses and doing all of this work and giving us locations, I guess I call that the gazetteer as locating service, as a service, not as a database.

So here are some of the questions that come out of this. I was just asking how much of the parsing and searching and scoring is really the job of the gazetteer, and how much do we say that's the job of a locator service that sits on top of a gazetteer conceptually? Another way to look at that is, are we developing gazetteer standards that are intended for the developers of location services or the developers of websites, that you type things in and want to get answers back? Or are we talking about gazetteer standards that are intended for direct use by end-users, where they take on the responsibilities of doing this parsing and scoring and so forth? So that's one kind of scope for questions about gazetteers that I have in my mind anyway.

Another question is, how much context do we need to have in names? You can go forever with building a web of concepts and context for your names database, and I've seen that trend in some of the gazetteer designs. Where do you start? To use the example SFO, which is the airport in San Francisco - is it in San Francisco? Is it in the Bay Area? Is it near the delta? These green "is in" "is near" are actually explicitly modeled in the gazetteer. Well, there are a lot of those, a lot of web of related things and an interesting question is - how much of that context do we explicitly model in the database of the gazetteer, and how much of that do we just leave to the fact that names map to coordinates and we can use coordinates to do all this?

A good example again from working with the Alexandria gazetteer is - the Alexandria gazetteer doesn't explicitly tell you what watershed places are in, but you can go to the watershed from the place in the Alexandria gazetteer - but it ends up doing a spatial search to try to make that mapping. So, as designers or developers of gazetteers, I guess my point of view to stir discussion is that we should have the minimal amount of context that is necessary in the gazetteer, and that context should be relevant to the addressing style of the gazetteer. For example, a gazetteer supporting postal addresses should provide context in the postal hierarchy, not necessarily context in the sales territory for Acme Corporation's hierarchy. But that's an interesting question.

I guess this leads to - how much do we try to find a one size fits all standard in what we're doing, or do we develop a shared notion of a broad reference model, or a set of terminology or what have you and then we have specialized implementations within that context.

So there are some questions for discussion and some other things to hopefully stir things up a little bit. Thanks.

Mike Goodchild - Thanks Scott. Any questions? Fred?

Fred Broome  (US Bureau of Census) - Scott, when you showed your little green box up there, and then you brought down the database named "things", I got distracted trying to make notes. Did you imply that the yellow ellipse could be external to the blue box, and it would have its own set of needs and standards and rules etc. that we could address; it really doesn't have to be inside the location service per se, as long as the interface could be standard?

Scott - Yes, that's exactly my point. That we could agree on some standards for the yellow box that are independent of address matching, and I guess to use an example in the case of address matching, which is an area that a number of us in the room are familiar with, the yellow area would be like the DIME file or the Tiger file, it used to be called the address coding guide, and that is actually an independent component of an address matching service, and we can agree upon some ways to represent and exchange that kind of data independent of the rules that we would actually use to parse and search addresses.

Doug Yanega (UC Riverside) - This reminds me of something that hasn't been mentioned, at least not that I've heard yet, and that is, you have your little arrow coming back out, which is your feedback for your user or your client, and it seems to me that this is something that has got to have an aspect that is designed or customizable by your user or your client. For example, if all I want back is a UTM thing, then I should be able to minimize the feedback that I'm getting from this program to speed up the process from my point. I don't want to get a screenful of output every time I enter an address, I want just one thing and sometimes I may want that screenful, so that's just something that strikes me has to be kept in mind for people working on this  - where the user defines what they're getting back out of the gazetteer as well, and I haven't heard anybody explicitly describe where that fits in.

Scott - Yes, I think that's very important. As I was saying, the best goal would be to just go and put in the address or the place description and you end up there, or you match your collection element to that. If you can't go there, you need to do what in geo-coding terms is called reject processing, or analyzing the near misses, and a part of that is this notion of assigning scores to matches. If it's perfect, it's just done; if it's imperfect, then you're presented with the best possible choices, but you also end up wanting to have visibility into other aspects of the system, like you may see, oh gosh, it parsed my township range into the wrong stuff, I want to just see how it interpreted what I put in and I'd like to work with it in a structured form, and do some exploring around through the relationships in the database also. But yes, there is an interesting part about how we define an interface to this blue box, or green box, that also supports the user interface requirements.

Mike Goodchild - Scott, I have one question. You imply in a couple of things you said that ESRI is actively involved in building something. Can you flesh that out at all?

Scott - Well, where this fits in is really in the context of our geographic database model. Conceptually we have locators, and a location service is a component of a geographic database, so just like a database it contains layers of features or imagery or whatever. It also contains locators, location services. A location service is essentially implemented or defined to be this blue box. It takes addresses and generates locations and has a set of interfaces that you would use in designing a user interface around that. We've implemented a number of different locator objects or location services, one that deals with U.S. address matching against a compressed CV, another that deals with U.S.-style street address matching against database records that contain in a standardized format that are this database of named things, and we also have a locator, a very trivial placename locator that just looks for things by matching names, and we have something called a latitude-longitude locator that takes string column, which has degree signs or whatever, and parses that, and an x-y locator, but in doing all those kinds of locators we've implemented them to a common framework, and the notion is that that framework could be extended and you could build other more specialized locators. It's a non-trivial job to do but we would plug in a township-range locator in the same way, and that's part of the development work we've been doing as part of the next generation of ARC-Info work.

Linda Hill - Scott, Allen said that you were co-operating to build a gazetteer-type component for the interactive National Geographic Atlas, and that you were doing that by combining the two geographic name gazetteers. Could you say something about that?

Scott - Yes, I'm sorry to say I'm not real familiar with what's going on in the details of that project That's being done in web time and I think they're like here. They're building something that will work. That's not being done in the context of this information framework. It's being done in another context. It's again a locator service but I don’t think it's so generalized. The data sets that we're aggregating together to support this are the USGS, GDT and NIMA postal geographies and placename geographies, and we're building an interface as a web service where you pass an address and it passes back to you a structure of possible candidates, and then that's used by other elements in the website itself to either zoom maps there or stick pins in maps there, whatever.

Allen Carroll (National Geographic) - Just to add a little interesting aside to that, apparently we'll also have the ability to keep track, keep a log of the names that users don't find, so that if our gazetteer is imperfect or missing a certain category of, say, a point of interest or something rather than a town, we'll keep a log. If we see multiple entries in certain categories or certain names, we'll know we have a particular problem to fix, which is a nice little feature.

Dan Cole (Smithsonian) - Our problem often with our collections, like the Natural History Museum up in New York, is primarily with the older collections, the 19th Century collections, and there you're dealing with questions of what just where is "near" to such and such a place, and can you logically, but granted arbitrarily, define through some sort of algorithm and within a certain amount of walking distance, what "near" means?

Another question, when we've got place names that were defined by 19th Century explorers that have never been used since, they're not even colloquial place names, they're just something that was named by somebody who was an explorer, but hasn't been used by anybody at any other time before or since or during that time period.

Scott  - Well, I'll say something about that. I'm sure that's a real interesting question. One thought that came to mind is that a lot of these places existed in the context of an expedition, and you don't necessarily need to have a comprehensive placename geography of Central America for looking at canal survey work. What you need to do is actually go in and do the research for the little world that the Panama Canal people lived in, and their train stations and their camp numbers and so on, and actually build a gazetteer, perhaps specifically for a field station or a for an expedition, and then build another one for another expedition or another context of place names, and perhaps things like big traversing expeditions where they wander round, spending the time to go through and do a little monograph as it were on here's the time line for that expedition as a starting-point, and then you can go and use those as sources. That might be more tractable than trying to build a universal system.

Mike Goodchild - Kate, do you want to comment on "near'?

Kate Beard (Univ. of Maine) - "Near" is a real problematic term and it's very context-dependent, and one of the things we've been looking at is - you can have qualitative descriptions of "near", to separate "near" and "far" and that is contextually defined, but you need to look at what's the context of the query, and I think in your case it's really interesting to think about the historical and technological context of what's "near" - I mean, is it a day's hike, in terms of building a context of saying, OK, "near" is a fuzzy context that we define qualitatively.

Fred Broome - Kate, are you saying that Max is suggesting that, let us say, a custom gazetteer that you might make up for an expedition as Scott stated would have a rule defining "near" for that expedition, and it might be a different rule defining "near" for somebody else's gazetteer for a different set of data?

Mike Goodchild  - There's another twist to that, which is that it depends who's asking and where they are, so an electronic gazetteer should have an idea of who's asking.

Fred Broome - To push that, if I were sitting down in my home in Augusta, Georgia, at that point Reston is near Washington DC. If I'm in Reston, Hearndon is near Reston but Washington DC is further.

Mike Goodchild  - If you're inside the Beltway, the context is completely different. (Laughter)

Bob Hoffman (Natural History Museum) - I am a user and practitioner of some of these things. I have traced some of the Prezhevolsky expeditions across Tibet and the Kozlov expedition, and that simply won't work, because for those important points that you're trying to find, there is no way of determining what Kozlov or Reberovsky or Prezhevolsky had in mind. It might work prospectively as a standard for new expeditions going out, but for any expedition that is in the past it simply doesn't work, and to attempt arbitrarily to define "near" is just opening a whole can of worms. "Near" ought simply to remain as "is near" - near Tunaville, you know where Tunaville is or was, probably, so you want to geo-code it, fine-code it as Tunaville, with "near" as a modifying adjective and leave it at that. I don't think there ought to be any attempt to arbitrarily define these things.

Mike Goodchild - OK the last person in the session is Lee Hancock. Steve Smyth from Microsoft was hoping to be here, unfortunately he's not, but we're very grateful to him for sponsoring last night's reception, so we feel that we've benefited from his absence if not from his presence. Lee is with Go2 Systems, and this you recall is the session on the uses of gazetteers and I think Lee has an interesting perspective that we haven't heard before, in terms of the populations of users.