Linda Hill
Alexandria Digital Library
University of California, Santa Barbara
Presentation given at the Metadiversity Symposium
Accompanying slides are at http://www.alexandria.ucsb.edu/~lhill/alex-imp/metadiversity/sld001.htm (navigation buttons on the left)
Alexandria Digital Library (ADL)
http://www.alexandria.ucsb.edu
ADL is a georeferenced digital library. One of the six Digital Library Initiatives funded by the NSF, DARPA, and NASA. The funding period has now ended and ADL is in the process of becoming an operational component of the UCSB library system and the California Digital Library that will have its public presence at the end of 1998 or the beginning of 1999. Currently access to ADL is limited to the University of California campuses only. We have limited computer equipment and help staff and cannot open it up to everyone.
All kinds of information can be georeferenced with latitude and longitude coordinates. You might think primarily about maps, aerial photographs, and remote sensing images but text, specimen collections, music, people, and gazetteer entries can also be georeferenced to a place that they are about.
A look at a couple of screen shots from the ADL client will illustrate the users’ view of ADL. The first slide shows the Map Browser, the Help Window, the Search Window, and the Workspace Window open in the midst of a search. In the Map Window, the user has drawn a query area to indicate the area of the world that he or she is interested in. In the Search Window, the user is given the choice of two collections (in this example) and the user has chosen the "ADL Catalog" to search. The Workspace Window shows a portion of the results of the search. These are items whose geographic footprints overlap the user’s query area. One of the items in the list is highlighted and a thumbnail image of the data is displayed. This illustration does not show the related item footprint. We will look at another view of the interface to see that feature.
The next slide shows the same set of windows (plus the Query Status Window) for a different search. Here we can see a set of item footprints in the Map Browser that correspond to the results set that is partially shown in the Workspace Window. Again, one item in the list is highlighted and a thumbnail view of the item is shown. At the same time, the associated footprint for the item is highlighted in red in the Map Browser. The user can also click on a particular footprint in the Map Browser and view the associated item in the Workspace Window. The Search Window in this illustration shows a summary of the search that was submitted. The collection chosen was the ADL Catalog. The user specified a query area and further limited the search to certain types and to certain formats (shown as "Available-As Types"). The user can save the status of the Workspace and reload it at a later time.
The next slide shows a diagram of the ADL system architecture that supports what you see in the user interface. The system consists of three levels: the database level, the middleware level, and the client level. The middleware is the heart of the system. Multiple clients can be developed for it; the interface that I showed you is a Java client, which we call JiGi (Java Interface for Georeferenced Information). On the other side, the middleware interfaces with the database level. There can be multiple collections, held locally or remotely, and the collections can have their own structures. The databases present query and retrieval views of the collection objects to the middleware. The middleware dynamically discovers which collections are available for searching, and how they can be searched, and presents these to the user through the client. User-created queries are fanned out by the middleware to the queryable parameters of the collections through appropriate retrieval software. The results are merged and presented back to the user.
Another view of ADL structure is the metadata view. Here we started with a data set that represents its objects through full object metadata. ADL does not specify what the metadata at this level needs to be; it can be whatever is suitable for the collection objects. In practice, creating metadata for the data is often a labor-intensive step. Object level metadata is mapped to ADL in three ways: (1) to the middleware search buckets and some additional scan attributes; (2) to the access report; and (3) to the full metadata report.
The search buckets provide a few high-level search parameters designed to search across diverse collections. This is somewhat like the Dublin Core approach of identifying core elements for description but the ADL search buckets are designed for searching. The ADL search buckets are
The access report provides the links to the actual data set if it is online, or to the point of contact if it is offline. It also provides information about any constraints for accessing or using the data and sometimes links to related information.
The full metadata report is a report containing attribute labels and values from the object-level metadata. ADL has created a style for these that provides a common look for the metadata from the various underlying collections.
Metadata Developments: Collection-Level Metadata
The key to accommodating multiple collections, and multiple types of collections, in this search environment is collection level metadata that describes the collections and how they can be searched. The collection level metadata gives the title and the ID for the collection, the search buckets populated, and the controlled lists and controlled vocabularies associated with the collections. It also gives a collection description of two kinds:
Contextual metadata is provided by the collection owner and includes such information as the purpose and description of the collection, its frequency of update, any constraints to its use, and contact information for the responsible person.
ADL uses collection metadata for two purposes: collection registration and user documentation. Collections are made known to the ADL middleware through an XML version of the collection metadata. The middleware dynamically discovers which collections are available for presentation to the user through this method. This metadata also tells the middleware which search buckets are active for any particular collection and what the controlled domains are for the associated buckets. All mappings that are necessary for accessing the collection are contained in this XML registration version. User documentation is an HTML version of the collection metadata. This is displayed to the user on request.
Since the variety of collections is wide – indeed it is difficult to agree on the definition "collection" in the first place – ADL developed and implemented collection metadata to describe and register whatever collections come along. It has been very successful and provides us with a way to accommodate many more collections in the future. We therefore recommend the collection metadata approach to the biodiversity community. It is the key to accessing a variety of collections, where a "collection" is whatever someone decides to call a collection. It is a way to capture both inherent and contextual metadata for registration and user documentation purposes.
Metadata Developments: Gazetteer Metadata
The next ADL metadata development I will present is the work we have done with gazetteers. The word "gazetteers" is not familiar to everyone. They can be described as dictionaries of named geographic places. ADL further defines gazetteers to require three minimum descriptive elements for each place: (1) a name; (2) a location in latitude and longitude coordinates; and (3) a type or category. An example is
Name: GoletaThe following example illustrates the value of such a gazetteer in a digital library. A user has a "where is" type question: "Where is Philadelphia?" The system returns a footprint for Philadelphia and displays it on the map (this is a simplified example, ignoring for the moment that there is more than one Philadelphia in the world). Next the user asks "What rivers are in the Philadelphia area?" The system knows the footprint of Philadelphia and it knows the footprints of entries in the gazetteer of the type "rivers." It can make a match of these footprints and return a list of the rivers "in" the Philadelphia area. Next the user might ask a question like "What remote sensing images are there of the Philadelphia area?" This is a search of the catalog rather than the gazetteer. The system can compare the footprint of Philadelphia to the footprints of items of the type "remote sensing images" in the catalog and return a list of those whose footprints overlap the Philadelphia area. This retrieval is possible not based on the images begin labeled with Philadelphia but because the match can be made on the basis of footprints. This use of footprints is known as indirect georeferencing.
Type: populated place
Location: -119.83,34.44 (decimal degrees for longitude and latitude)
In building ADL, we developed a 6-million-entry gazetteer by combining the two large U.S. federal gazetteers from the U.S. Geological Survey (USGS) and the National Imagery and Mapping Agency (NIMA). In the process, we found out firsthand the difficulties of combining gazetteer data from different sources. We found out that there is no shared concept of how gazetteer information is represented. We therefore developed a Gazetteer Content Standard (GCS) and a Feature Type Thesaurus (FTT) to provide type categories. We are in the processing of implementing it.
GCS provides for the representation of names and variant names for places and information about these names: the source or authority of the name, the language, etymology, pronunciation, dates when the name was/is used, and more. Each name is assigned one or more type categories. If the place has a feature code (e.g., a FIPS code), it can be included. The location of the place can be given by a point, bounding box, or polygonal coordinate description. Features can be related to one another; e.g., one place "IsPartOf" another. Data such as elevation or population can be given for a place and links can be made to other sources of information, such as a city’s homepage. Temporal ranges can be given for the names themselves, the footprints, the data, and the relationships. Each entry, and each part of each entry, can be attributed to a contributor and to a source.
There is no common set of feature types for gazetteers and making different categorization schemes work together is one of the most difficult parts of combining data from various sources into a new gazetteer. We have developed a thesaurus of feature types which we are applying to our gazetteers and which we hope will be adopted by others. It is based on the Z39.19 standard for hierarchical thesauri designed for information retrieval. It includes broad term/narrow term relationships, synonymous terms, and related terms.
Both the Gazetteer Content Standard and the Feature Type Thesaurus are available through my homepage: http://www.alexandria.ucsb.edu/~lhill.
We are currently in the process of converting a current version of the NIMA gazetteer to the new Content Standard using the terms from the Feature Type Thesaurus. We already have sets of bounding boxes for countries and U.S. counties loaded as well as a set of volcano sites. We have various other sets waiting for conversion, including the GNIS from USGS. We are looking for sets of gazetteer information that include polygon or bounding box footprints to load. We are working on extracting polygon footprints for places from digital map products.
Georeferencing is an identification key that can be applied to all types of information – not all information, but to all types of information including placenames. Georeferencing is a "natural bridge" across information types because latitude and longitude referencing is universally understood. A spatially referenced gazetteer is a powerful component of a georeference system because it adds the dimension of indirect spatial referencing through the use of placenames.
We therefore recommend to the biodiversity community that standard practices for gazetteer development and use be adopted so that geographical site descriptions developed by one subgroup of the community can be shared and used by other subgroups and with other information operations as well.