Building Georeferenced Collections
Gazetteer Services

Presented by
Linda L. Hill
Alexandria Digital Library Project
University of California, Santa Barbara
lhill@alexandria.ucsb.edu

For
Taxonomy Authority Files Workshop
Washington, D.C.
June 22-23, 1998

[Accompanying PowerPoint slides]

The Alexandria Digital Library (ADL) is one of six four-year Digital Library Initiatives funded by the federal government (NSF, DARPA, and NASA). These projects are now in their last year. The Alexandria Digital Library focuses on georeferenced collections: maps, aerial photos, remote sensing images, text, and so forth. The collections built for the testbed include a Catalog of approximately 750,000 items, a Gazetteer of nearly 6 million placenames, and other smaller collections of bibliographic records (GeoRef), volcanoes, and earthquakes. The Gazetteer contains the combined contents of the two major U.S. federal government gazetteers: the Geographic Names Information Service of the U.S. Geological Survey and the GeoNames set from the National Imagery and Mapping Agency. ADL is a research project but it has also produced an operational service, which will become a component of the newly established California Digital Library (CDL) when it becomes operational by the end of 1998.
 
 

The ADL testbed system is designed in modules: the user interface client, the system middleware, and the database level (metadata and data archiving). It is designed to accommodate multiple metadata and data formats while providing middleware metadata for high-level search parameters that are mapped to the underlying metadata collections. Established metadata standards such as U.S. MARC and the FGDC Content Standard are used to represent the objects in the collections. Interoperability and integration is our focus. In terms of the gazetteer, we have taken the approach of designing the content standard for the representation of gazetteer entries rather than in specifically building authority files. The standard can be used both for authority files and for special gazetteers for local purposes.
 
 

Gazetteers can be defined as dictionaries of named geographic places (also called features and placenames). We further specify that for digital library purposes a gazetteer entry must have the following three attributes at a minimum: a Feature Name, a Spatial Footprint (latitude/longitude coordinates for location), and a Feature Type (category).
 
 

A gazetteer service is a digital library service that supports descriptive access via feature names and feature type to spatially represented information objects. Examples of the types of services that can be performed are:

An example is a query to a gazetteer service such as "Where is Philadelphia?" where the answer is a footprint on a map showing the area of Pennsylvania where Philadelphia is located.
 
 

The next query may be "What rivers are in the Philadelphia area?" For this, the service would compare the footprint of Philadelphia and to the footprints of entries in the gazetteer of the type "rivers" and return a list of rivers whose footprints overlap the Philadelphia footprint.
 
 

The user may now want to search for datasets that are relevant to Philadelphia that are represented in the digital library catalog. The query may be "What remote-sensing images does the library have that overlap the Philadelphia area?" Again a comparison is made between the Philadelphia footprint and metadata for objects of the type "remote-sensing images" in the catalog. The return is the set of remote-sensing images that are "about" Philadelphia, most of which will not actually have the word "Philadelphia" in their metadata.
 
 

Given that you want to create a gazetteer, there are four structural approaches.

If a hierarchical thesaurus model is used, there are two types of hierarchical relationships that can be represented: (1) the whole-part relationship, where Santa Barbara is part of Santa Barbara County which is part of California, etc., is the most frequently applied hierarchy for placenames; or (2) the genus-species relationship where Santa Barbara "is a" City which is an Administrative Area (for example). The metadata model accommodates both of these ways of representing placename relationships.
 
 

Based on the experience of building the initial ADL Gazetteer, the ADL team has designed a new approach on the metadata model that we hope will lead to standard representation formats for gazetteer data and thus to integration and interoperability among gazetteer products and services. The ADL Gazetteer Content Standard is accessible through the ADL homepage: <http://www.alexandria.ucsb.edu> (Publications, Metadata in the Documents/Tools section). The relational database model for this Content Standard, developed by Qi Zheng, can be viewed through my homepage: <http://www.alexandria.ucsb.edu/~lhill>. Both of these developments were partially funded by the NASA EOSDIS Project through Hughes Information Technology Systems (now Raytheon).
 
 

The Gazetteer Content Standard has 13 sections. The contents of these sections are briefly described here. Required elements are marked with an asterisks (*); repeatable sections are marked with (R). Each section, with the exception of 1 and 4, can be attributed to a particular contributor and/or source.
 
 

1. Geographic Feature ID* Unique identifier within the local system for the gazetteer entry. 2. Geographic Name* The primary name designated for the feature by the local system 3. Variant Geographic Name (R) Other names for the feature. For both of these name sections, the following attributes are available for further specification: 4. Type of Geographic Feature* (R) Each feature is assigned one or more feature types (categories); the scheme used for the terminology is also cited. 5. Geographic Feature Code (R) If a code, such as a FIPS code, is available, it is recorded here. 6. Spatial Location* (R) First the type of geometric representation is declared (point, bounding_box, linear, complex object) and then the coordinates are given as a set of points. Each spatial representation can be further described by date range and details of the measurement, such as date, method, and accuracy. 7. Street Address (Physical Address) For those features which have a street address, it can be recorded here. 8. Non-Spatial Relationship to Other Geographic Feature (R) In this section, relationships such as "in the state of" and "is capital of" can be described and each relationship can have a particular date range. 9. Description A short descriptive statement about the feature can be added in this section. 10. Geographic Feature Data (R) Data associated with the feature, such as elevation or population, can be recorded here. 11. Link to Related Source of Information (R) In this section, online links (e.g., URLs) can be recorded which give further information about the feature. For example, the homepage for a city. 12. Supplemental Note A place for additional information that can't be recorded elsewhere. 13. Metadata Information Details about the creation of the entry.
 
 
In addition to the main sections of the Gazetteer Content Standard, there is an accompanying structure for recording the details about the contributors and the sources of the information.
 
 

Each of the gazetteer entries must be categorized by the type of feature it is so that groups of features of a particular type can be identified for a region and so that features with similar names can be distinguished from one another. In establishing its first gazetteer, ADL combined the category schemes used by the two federal gazetteers to come up with a class/type hierarchy. This was a very difficult job and the result was not a true thesaurus of terminology. Therefore, ADL has developed a Thesaurus of Feature Types. It was designed according the ANSI/NISO Standard Z39.19 ("Guidelines for the Construction, Format, And Management of Monolingual Thesauri"), using MultiTes thesaurus software <http://www.concentric.net/~Multites/>. It has (currently) 578 terms of which 195 are preferred terms and 383 are variant or synonymous terms that point to the preferred terms. There are six top terms that form the basic organization of the hierarchies:

Terms were drawn from existing gazetteers and related publications. The Feature Type Thesaurus can be browsed at <http://www.alexandria.ucsb.edu/~lhill/html/index.htm>. Comments and suggestions are welcome.
 
 

The current status of the new ADL Gazetteer is that we are just getting started with the process of populating the new relational database. We have already loaded bounding boxes for 3,111 U.S. counties, 50 U.S. states, and 171 countries/continents/regions, and point locations for 1,508 volcanoes. We are developing the conversion rules needed to convert the original ADL Gazetteer to the new format - converting the categories correctly is problematic and will take some manual editing. We have several sets of additional gazetteer data waiting to be converted when we have the hardware and the person power to do it.
 
 

A Master's student in the Computer Science Department, Zheng Wang, has developed a metadata creator tool based on Java and XML and customized it for the ADL Gazetteer Content Standard. We are in the process of further developing this for integration into our digital library system. It incorporates the use of the ADL interactive map to create and display the geographic footprints for the gazetteer entries. It also accesses the Feature Type Thesaurus so that appropriate type terminology can be selected to describe the new entry.
 
 

Our plans call for continuing to populate the new ADL Gazetteer and promoting the wider use of the Gazetteer Content Standard and the Feature Type Thesaurus. Each would benefit from collaborative development by other groups who are actively building and using georeferenced collections of data and information. We are also embarking on a project to mine bibliographic records that are georeferenced with placenames, using the gazetteer to add geographic footprints (coordinate values) to those records. This has the potential to open up the vast number of georeferenced materials represented in library catalogs and online bibliographic files to spatial searching.
 
 

Some of the issues that we face in connection with gazetteer development are

Gazetteers are key components of georeferenced information systems. Yet we are not aware of any other efforts to support the integration and interoperability of gazetteer data so that the results of numerous local efforts in creating this information can be shared. If we can create the means to do so, the resulting availability of gazetteer information will bring tremendous gains across the board in all types of information systems. ADL has developed both a Gazetteer Content Standard and a Thesaurus of Feature Types, which we offer freely to others in the hopes that these will encourage standardization and sharing of data. In particular, consideration of shared gazetteer files for taxonomic description should be part of the discussion of taxonomic authority files.