Linda Hill (UCSB) - Foundations of Shareable Gazetteer Data

Please cite as

Profile             Slides

The Alexandria Digital Library Project at the University of California, Santa Barbara, is focused on building workable prototypes of georeferenced digital libraries. This means that a primary key for every item in the ADL set of collections is a geospatial footprint. A footprint is as simple as a latitude/longitude point, or as complex as a detailed area boundary representation. ADL is designed to allow user queries to also be expressed in terms of geospatial footprints. The search and retrieval mechanisms support spatial matching of query footprints to the footprints of items in the collections, as well as the matching of other search parameters through text search software. This is done in an organized library environment with collection building; metadata creation; generic queries across dissimilar, distributed collections; middleware and client software; and user evaluations. The ADL has become an operational system that will be maintained by the Map & Imagery Laboratory in the UCSB library, and will be a component of the new California Digital Library early next year.

A major component of ADL from the beginning has been a gazetteer. We built originally a 5.9 million item gazetteer by merging the two U.S. gazetteers together - those that are developed under the Board on Geographic Names - the GNIS from the USGS and the GNPS from NIMA. In the ADL environment we treated that gazetteer both as a collection that could be searched directly and for indirect spatial referencing, where a user could start with the idea of wanting to find information about a place, thinking of that place in terms of a place name; taking what they found in the gazetteer for a footprint and using that footprint to search for items, such as aerial photos, that are about that place but don't say so with a place name. We treated the gazetteer both as a collection and as a tool for indirect spatial referencing. You can find a lot of details about that particular development in the online D-Lib magazine in the January '99 issue.

We didn't stop there, however. We took the lessons learned from developing and using the gazetteer in a digital library environment and we developed what we call a content standard - a standard way of representing information about place names. We also developed a feature type scheme so that we could categorize those place names. We have an online gazetteer service. It has been up for about 6 months or so. The next thing we are getting to with this workshop is to involve broader user communities in the issues and development of gazetteers.

Let me say just a few things about the ADL Gazetteer Content Standard. I put a copy of it in your packets so that you could refer to it. It is a general framework for describing geographic places. It is what I call metadata-like rather than thesaurus-like. I mentioned earlier that the Getty has a Thesaurus of Geographic Names. It is a wonderful product. But we did not take the thesaurus route, which is what I call an "authored document", but took the metadata route instead, which is an attribute set and a structure that anyone can use to create a gazetteer entry. If we use a common standard, then we can share information from all of the different sources of gazetteer information. Thesauri are notoriously hard to combine and share because they embed a certain contextual structure. But we can share collections of metadata - it's part of the reason for metadata standards.

Like metadata standards, the ADL Content Standard has core elements. We say that the core elements of a digital gazetteer entry are name, footprint, and type. We also specify that every piece of information in a gazetteer entry be linked back to where it came from - a provenance for each piece of data in the record. We believe that records can be built for one place from multiple sources, so each piece of data should contain within it a link back to its source.

Mike has already mentioned the importance of temporal ranges for data and for linkages. That is provided for in the Content Standard. And beyond the core it accommodates rich descriptive information - for historical information for example. For a place name, we can describe what language it is in, what the etymology of the name is, and so forth. We included several ways to include rich descriptive information capability in this content standard. One thing it does is accommodate various footprint representations, so for one place there can be multiple footprints. They may be of different types - a point, a bounding box, a polygonal boundary, etc. - they may be from different sources or for different times.

There is flexibility in the use of type schemes. When I talk about the type scheme we developed, it is a type scheme we are using and I want to interest you in, but the content standard itself is designed so that any type scheme can be used.

The Content Standard also provides flexibility in specifying the relationships between places. You will see that the content standard provides for a way to say that place A is related to place B. The actual relationship type is something that is filled in. We have only used "IsPartOf" - that is, place A is part of place B - but a whole set of relationships could be expressed using this content standard.

This is, then, a flexible standard, one that we would like to see others adopt. We could immediately provide access to gazetteers collections developed in accordance with this standard.

Now let's talk a little bit about feature type schemes. During four years of experience with the Alexandria Digital Library, we have found that type has been a key attribute for both description and searching. Users will typically use for their queries a geographic location and a type and a format from the search parameters that we provide for them. With a common type scheme across collections, we are enabling users to do semantic searching across dissimilar collections. We have provided a pick list for types in both our catalog and in our gazetteer. I also want to say, however, that when we bring in sets of data from some place else, the most difficult conversion we have to do by far is the type schemes. For those who have reservations about using control vocabulary for these things, you are right - it is not easy, and I'm just going to give you a few examples of that.

For the Board on Geographic Names (BGN) folks, I'm not picking on you. I'm just using your schemes as examples. The U.S. gazetteers are some of the most remarkable products we have, and they are the ones we have worked with the most. This is an illustration of the nature of the type scheme problem. The NIMA GNPS is the part of the BGN that covers non-domestic place names. It has these top-level categories. There are only a few of them. They include such categories as "administrative boundary features," "hydrographic features," "area features," "populated place features," etc. So every type that they assign to a place name is in one of these classes. This next set shows examples of the specific types they have. So if you look at the airfield or airport kinds of things, they have "airfield," "airbase," heliport," "airport," and "abandoned airfield." For the railroad station category, you see a similar pattern. This illustrates the kind of approach that exists in the NIMA part of the BGN. If you look at the GNIS set for domestic places, you will find these kinds of categories. So you can see by looking at these examples that they use two very different ways of approaching categorization, and you can appreciate the problems associated with merging them into one scheme across a merged set.

We came up with a Feature Type Thesaurus.It has six top terms. This is a hierarchical thesaurus - a tool that is used in information science. It is a mature approach to concept representation, and it is based on ISO and ANSI standards. The ANSI standard is Z39.19. It is a very powerful tool for representing concepts and types. Here is an example of what an entry looks like in this thesaurus. This is the category of "harbors." Included is a set of variant names that a user might come up with for things like "harbors." A user can use any of the Used For terms and find out that the term "harbors" has been used to represent this concept. Please don't get agitated about any particular word that you see here. That's not the point. You can disagree with me that "boatyards" are "harbors." The point is that with this structure you can have "lead-in" vocabulary or variant vocabularies for concepts, and you can have all of those available to the user. You are using them to lead the user to the word chosen to represent that type. There are also broader terms. A place of the type "harbor" is also of the type "hydrographic structures" and of the type "manmade features", and can be found that way. There are also related terms to let the user know of any other valid terms in this set that they might want to use for searching - in this example, "marinas," "breakwaters," and "piers."  Scope notes can also be present to provide definition.

The ADL Feature Type Thesaurus has over 200 valid terms that can be used to categorize place names, and close to a thousand lead-in terms that have been identified during the process of merging of gazetteer datasets.

There is a design decision to be made about the flexibility and specificity of the categories, depending on the application. I may decide that "wetlands" is sufficient for our application, but you may decide that's not specific enough for you. What we would really like to have is a thesaurus structure where we can plug more specific sets of terms into more general sets. This is something that Quinn Hart  (who is here) has been thinking about a lot and perhaps he will have a chance to say something about later. If we can work this out, it will be a way for us to have a general framework in common, but more specific terms for other applications. An advantage with using a controlled set of category terms is that we can develop computational methods that are based on a corpus of already classified information that can suggest category terms both to catalogers and to searchers.

ADL has established a gazetteer server. It is a single point service that accesses our own local collection, and it is integrated into our digital library. We need to move toward distributed gazetteer services, so that we can launch a gazetteer query to multiple gazetteer servers. This will require agreement about ways of handling distributed search and retrieval. The other component we'd really like to have is ingest systems, so that we can support the building of personal gazetteers, departmental gazetteers, and so forth, where people are creating placename records and putting them in a collection for their own purposes. Whether they go up to the authoritative level is not the question. But having a way for people to create this thing called gazetteer information in a shareable way is the objective.

The last point is the notion of gazetteer user communities and applications, and that's really a large part of why this workshop exists and why all of you have been invited. Almost every one of you represents some unique point-of-view. You have a community behind you, and you are the representative here for that application area. We have people here who are in the business of and responsible for establishing geographic name authorities. We have entrepreneurs who are creating products that have gazetteer components. We have geospatial data standards developers. There are people here who are expert and up-to-date about what the GIS-type standards are, and how a gazetteer standard might fit into what exists or into what is being developed. We have information system researchers and developers from clearinghouses, digital libraries, bibliographic indexes, and so forth. We have researchers here who are versed in information retrieval and how retrieval effectiveness is measured, about what is needed in a user interface, about how to describe the resources for effective retrieval. GIS researchers and application developers are also here, and those of us who are thinking about policy impacts and issues and implications of spreading gazetteer use.

Thank you.