4 CURRENT STATUS OF TESTBED DEVELOPMENT

Home Alexandria Digital Library: ANNUAL REPORT Prev Next

4 CURRENT STATUS OF TESTBED DEVELOPMENT

We now describe the main aspects of the development of our testbed system in terms of the current version of the WP. We describe the system first in relation to its general, four-component architecture; we then describe in greater detail each of these four components; and finally, we describe our activities in terms of two cross-cutting technologies, namely image processing and parallel processing. We note that this description involves a mixture of purely developmental activities and research activities, since these two sets of activities are essentially in separable.

4.1 Architecture of the Testbed System

The current testbed system, which is revised version of our initial WP, is based on the four-component architecture illustrated in Figure , derives from the four major components of a traditional library. This simple and robust architecture has been of value for both the RP and WP systems.

The catalog component includes metadata and search engines that permit users to identify holdings of interest. The storage component contains the digital holdings, organized into collections. The user interface supports graphic and text-based access to the other ADL components and services. Librarians use the ingest component to Store new holdings, extract metadata from the holdings, and add metadata to the catalog.

The WP architecture is a special cases of the general architecture, with differing languages and protocols at the component interfaces. Figure also illustrates the languages and protocols employed in the WP. Unlike the RP version of the ADL, the WP storage and catalog components are distributed.

  1. The Catalog Component

The catalog component of ADL permits users to map their information requirements into the most appropriate set of information in the collections of ADL. While a traditional library cataloging system (author-title-subject) provides a basic model for a DL's catalog component, it is inadequate for geographically-referenced holdings such as maps and images. Catalogs for geographically-referenced information must additionally support access to holdings in terms of their representations, their spatial "footprints," and their contents.

DL technology greatly increases our ability to extract, store, and search new classes of metadata about library holdings. A major thrust of ADL activity is thus to extend current models of catalogs and metadata. Also, to support catalog interoperability, the ADL is employing standards to represent and exchange catalog information.

To meet these criteria we developed a catalog schema for the RP using elements from the USMARC and Federal Geographic Data Committee (FGDC) metadata standards. We then expanded the schema for the WP to include metadata supporting simple content-based queries.

4.2.1 Basic Metadata: USMARC and FGDC Standards

The basic metadata for geographically-referenced information in both the RP and WP systems combines elements from the USMARC, and the Federal Geographic Data Committee (FGDC) metadata standards.

Since the late 1960s, USMARC has been a national standard for database descriptions of library holdings. It includes fields for cataloging analog geographic data, as well as open-ended "local-use" fields that can be made to accommodate digital data. As of 1995, the USMARC standard contains the full FGDC standard and, unlike FGDC, includes thesauri (e.g. for subject headings).

USMARC stores all metadata for a given holding in a single record with four components: leader, record directory, control fields, and variable fields. This "flat" structure, while not optimal for a relational database, is useful for specifying metadata input/output functions and for exchanging metadata records between different DLs.

The FGDC promotes the coordinated development, use, sharing, and dissemination of surveying, mapping, and related spatial data. Its metadata standard for digital geospatial data has been mandated for use by all U. S. Federal agencies.

Relative to USMARC, the FGDC standard only provides definitions for a small number of fields, and their logical relations in a hierarchical structure. While these fields are adequate for cataloging digital geospatial data, they do not accommodate analog spatial materials. Moreover, the FGDC standard does not specify a particular format or structure for metadata representation, resulting in a variety of implementations and a lack of generic import/export functions.

By combining the FGDC and USMARC standards, the ADL has been able to catalog all forms of spatial data thus far encountered, including remote-sensing imagery, digitized maps, digital raster and vector datasets, text, videos, and remote WWW servers. The metadata schema for the WP has approximately 400 fields, including all FGDC fields and selected USMARC fields. To create the schema, we converted the FGDC production rules and USMARC record hierarchy into a single normalized entity-relationship (ER) data model, from which the physical database schemata are generated automatically by CASE tools.

4.2.2 The ADL Gazetteer

The WP catalog incorporates two major extensions to the combined FGDC-USMARC metadata model, both supporting forms of content-based search. The first extension allows digital image holdings to be searched for occurrences of preselected image features, such as textures, and is more fully described below.

The second extension allows ADL holdings to be retrieved based on the relationship between the footprints of the holdings and the footprints of named geographic features (such as cities, rivers, and mountains.) Lists of such features and their footprints are commonly called gazetteers. In addition to feature names and footprints, gazetteers include varying detail about the type of each feature, often organized as a class hierarchy (e.g., hydrographic_feature.watercourse.ephemeral_stream.)

The ADL gazetteer is a hybrid of two large standard digital gazetteers, maintained respectively by the USGS Geographic Names Information System (GNIS) and the Board of Geographic Names (BGN). The GNIS gazetteer contains about 1.8M names of US features, organized hierarchically into 15 classes of features, while the BGN gazetteer contains approximately 4.5M names of land and undersea features. The ADL gazetteer is a union of the GNIS and BGN names and footprints, and an intersection of their feature classes.

The ADL gazetteer is maintained in the ADL catalog database, but is also available externally for search by the Excalibur(ConQuest) semantic network text-retrieval engine. We have found the ConQuest version of the gazetteer useful for "fuzzy" gazetteer searches, where a user may not know the precise spelling or configuration of a gazetteer feature (e.g., "Santa Barbara Airport" versus "Santa Barbara Municipal Airport").

There are two significant research issues associated with the use of gazetteers in the ADL. First, different gazetteers use different terminologies and hierarchies to describe the same features. So far we have been able to construct our "crosswalks" between gazetteers by consulting their reference documents, but this will not be possible with (for example) historical placename lists.

A second issue is the nature of a named feature's footprint. For example, what is the footprint of "Santa Barbara" - is it the city limits or the benchmark at City Hall or the county boundary? Existing gazetteers often provide only point locations for both point and area features. It is often unclear how the points are chosen, and whether they are centroids, corners, or some arbitrarily chosen point. To complicate matters, there is the issue of how to deal with useful named features that have only "fuzzy" footprints (e.g., "Southern California", "Sierra Nevada"). In cases like these, ambiguity and fuzziness are inherent in a person's notion of the spatial extent of a feature, and they are particularly difficult to specify.

4.2.3 Other Catalog Issues

As the ADL catalog grows, spatial indexing methods play an increasingly important role in supporting footprint queries. We are investigating various methods for indexing multidimensional hierarchical data (see the paper by Kothuri and Singh, 1995) such as footprints. In particular, we have extended B-trees to "IB-trees", which accommodate objects that span a range of values (intervals) rather than single values (points) in the data space. IB-trees decompose d-dimensional data objects into d intervals, one per dimension, and index the intervals in each dimension separately.

Although the primary external interface to the ADL is the WWW, the ADL catalog also supports a Z39.50 interface. Z39.50 is the standard online protocol for traditional library catalogs, and is also the current standard search protocol for the National Spatial Data Infrastructure (NSDI). The NSDI is being coordinated by the FGDC as a collection of Z39.50 servers supporting queries against FGDC-compliant metadata.



4.3 Collections

The focus to date of ADL has involved the development of collections sufficient to demonstrate the functionality of the library. Hence the collections are still relatively small.

We have loaded the following data sets:

We are currently loading a multi-gigabyte data set of ecological information from the Sierra Nevada Project (SNEP).

4.4 The User Interface

The ADL's user interface enables:

The RP's UI, based on the GIS software package ArcView, supported the first three of these functions. In the remainder of this section we address issues that arose in migrating the ADL to the WWW environment, and in adding significant functionality beyond the RP.

4.4.1 User Interface Issues

The ADL WP must operate within the following WWW limitations:


No WWW browser that we know of supports vector data display, nor does HTML make any explicit provision for vector data. This is a serious issue, given the large fraction of vector data in the ADL collections. Vector data input, such as defining a geographic search region, is possible only with great difficulty. A natural procedure would be to draw a polygon on a base map by either clicking on multiple points or clicking and dragging over a desired region, but these actions are not supported by current WWW browsers, which immediately send an HTTP request after a user input event (e.g., mouse click.)

HTTP's statelessness hinders browsing and searching. By default, once a server responds to a client's HTTP request, neither the client nor the server retains any state or "memory" of the transaction (other than perhaps logging the URL involved). This makes it difficult to implement such essential features as per-user configurations and iteratively-refined searches. To simulate a stateful connection (e.g., a "session"), information must be explicitly maintained by either the client (in parameters stored in the URL or "hidden" HTML form variables) or the server (in unique user identifiers and a session database.)

To tailor its services to different levels of users, a UI should be user-customizable, and be able to save a particular configuration for use in future sessions. Additionally a user must be able to retrieve a particular data item or metadata record. Since the WWW is part of the Internet, simple bulk retrieval via FTP is straightforward to implement. As noted above, however, the holdings in the ADL are often extremely large, so methods that allow for the extraction and progressive transfer of relatively small increments of data holdings are also required.

4.4.2 User Interface Implementation

Conceptually, the WP UI is a collection of HTML "pages" implementing three major search capabilities:

As well as control/configuration and help/glossary links. The UI is designed around a state transition model with each state representing a WWW form or page, some of which include partial or complete query results. The HTML code for the WP UI is generated dynamically by approximately 15K lines of Tcl code running in a NaviServer HTTP server.

The primary function of the both the map browser and the gazetteer pages is to allow the user to define spatial extents or regions for catalog searches. The map browser allows these search regions to be defined explicitly (by zooming and panning a base map), while the gazetteer defines them implicitly (as the footprints corresponding to place names and feature types.) Figure shows a screen dump of the map browser.

The visible portion of the map browser's base map (the display window) is the default search footprint (the query window), but this relationship can be modified (e.g., the user may specify a subset of the display window, or may direct that the display window be completely ignored.) The base map is also the background on which the gazetteer and catalog query result footprints are drawn. The base map images are dynamically generated by a Common Gateway Interface (CGI) application based on the Xerox PARC Map Viewer [http://www.parc.xerox.com/map/], which we have modified to support generic labeling, fast panning, and graphic overlays.

Figure 2: The map browser component of the interface.

Figure 3: The browse graphic display returned from a query.

Gazetteer queries may interact with the map browser. For example, if the current map browser query window contains the USA but not Europe, then a gazetteer query with the place name set to "Paris" (and the query window enabled) will return Paris, Texas but not Paris, France. The map browser, in turn, may be directed to reset the query window to the minimum bounding geographic rectangle for the gazetteer query results.

Query windows resulting from gazetteer-map browser interactions are ultimately passed to the catalog page for incorporation into catalog queries. In addition to geographic footprints, the catalog page allows the user to search against any of the metadata fields (such as theme, time, or author) in the ADL catalog, expressed as textual or numeric values.

Catalog queries are assembled from user input into a generic conjunctive normal form (CNF) representation, and then translated to the specific query language (currently SQL) of the catalog DBMS . Query results are converted to HTML tables, with hyperlinks to browse images and online holdings. Query results are presented incrementally, with a subset of the metadata fields displayed initially and complete fields subsequently displayed for user-selected holdings. The format and fields used in the query results are completely user-configurable.

Queries may also return the footprints of ADL holdings, which may be displayed on the map browser base map. Unfortunately, it is common for many more footprints to be returned from a catalog query than can be shown intelligibly on the map browser's relatively small display. When footprints of multiple data holdings are displayed on the same map, it is difficult to distinguish which footprint is associated with which item. We continue to experiment with heuristics and visual aids (such as clustering and labeling) for disambiguating "crowded" footprints. In Figure we show examples of the browse graphics that may be returned as the partial results of a query.

The WP UI stores all user configuration parameters, query statements, and current query result sets in a separate (from the catalog) database maintained by the NaviServer HTTP server. This state information may also be stored on request on the client side in "hidden" HTML form variables. This allows a user to save an ADL "session" by using the browser's save-page feature. The session may be restored by reloading the saved page. Otherwise, state maintenance is handled entirely by the server, with only a minimal opaque handle used on the client side to identify the current session.

4.5 Image Processing

We are applying image processing technologies to a range of ADL issues. Image processing has implications for efficient storage, access, and retrieval of DL holdings.

Bandwidth and/or storage limitations often make it impractical to retrieve a large image from a DL as a single item. Furthermore, different users may desire different levels of image resolution. A general solution to these problems is to maintain hierarchical, multi-scale representations of image data. Our particular solution is to employ wavelet transforms.

Wavelets have been widely used in many image-processing applications, including compression, enhancement, reconstruction, and image analysis. Fast algorithms exist for computing the forward and inverse wavelet transforms, and desired intermediate levels can be easily reconstructed. The transformed images (wavelet coefficients) also map naturally into hierarchical storage structures.

We are also applying image processing techniques to the general problem of content-based access to DL holdings. Our current implementation uses texture as a basis for describing and cataloging the content of a library of images.

4.5.1 Browsing and Progressive Delivery

A useful property of the wavelet decomposition is that the lowest-resolution components may be used as "thumbnail" images for browsing. Experience with thumbnails in the RP convinced us that they are invaluable for browsing through large numbers of images and making rapid "go/no-go" evaluations. With wavelets we can support a richer browsing model in which users may zoom in on a given region until they have reached an acceptable level of detail. Wavelet transformations support the progressive delivery of these images, with rapid delivery of both the low-resolution browse images and the incremental higher-resolution components.

Current WWW browsers cannot display wavelet data directly. The WP gets around this restriction with a customized "helper application" invoked by the client browser whenever it receives an image of MIME type "wavelet". The helper application retains the previously downloaded components, so that the WP UI need only transmit the "next" component in response to a request for higher-resolution data.

The helper application is not our long-term preferred solution for wavelet display, since it requires us to make a locally-developed executable program available for any possible ADL client hardware/software environment. A better solution, which we are currently pursuing, is to develop an inverse wavelet transform as an "applet" in a portable language such as Java, that can be downloaded into a standard WWW browser such as Netscape.

4.5.2 Texture-Based Retrieval

Content-based retrieval is critical for accessing large collections of digital images. The ADL is investigating the use of texture as a basis for content-based search (see the paper by Ma and Manjunath, 1995) initially by adding catalog indices based on image texture features. Specifically, texture information is extracted from images as they are ingested, using banks of Gabor (modulated Gaussian) filters. This is roughly equivalent to extracting lines, edges, and bars from the images, at different scales and orientations. Simple statistical moments (e.g., mean and standard deviation) of the filtered outputs are then used for matching and indexing images.

The WP catalog includes a database of texture templates which can be matched against actual textures extracted from ADL collection holdings. One example of the class of accesses enabled by this information is initiating a search by choosing an image region. The region's texture will be used to retrieve matching texture templates, which in turn reference the ADL holdings in which they occur.

Figure is an example of browsing aerial photographs using subsampled (reduced resolution) versions of the image (on the left) and retrieval using query patterns. Users can browse through the thumbnails (top left) of the air photos, each of which are about 5K x 5K. A more detailed (1/10th resolution) image is shown on the right. Subregions can be selected from this larger image and similar looking patterns can be retrieved. The figure shows an example in which a parking lot is selected along with the dictionary code word and the retrieved patterns.


Figure 4: The image browsing tool for large aerial photos.

4.6 Parallel Processing

The Alexandria Project is investigating parallel computation (see the paper by Andresen, Egecioglu, Ibarra, and Poulakidas, 1995) to address various performance issues, including multiprocessor servers, parallel I/O, and parallel wavelet transforms, both forward (for image ingest) and inverse (for efficient browsing of multi-scale images.)

We have developed a prototype parallel HTTP server containing a set of collaborative processing units, each of which is capable of handling a user request. The distinguishing feature of the server is resource optimization based on close collaboration of multiple processing units. Each processing unit is a workstation (e.g. SUN SPARC or a Meiko CS-2 node) linked to a local disk. The disks are NFS-mounted to all processing units. Resource constraints affecting the performance of the server are:

Actively monitoring the CPU, disk I/O, and network loads of the system resource units, and then dynamically scheduling incoming HTTP requests to the appropriate node, keeps the server's performance relatively insensitive to request load, while allowing it to scale upwards with additional resources. In simulations, the round-trip total response time in seconds is improved significantly with the use of multiple processing units, and does not change significantly when the request rate increases, even into the range of 5 to 30 million per week.

We have observed similar speedups using a multi-node server when varying the size of retrieved image files (typical ADL holdings.) Since the computational and I/O demands of requests to the ADL vary dramatically for large images and complex metadata, the load-balancing approach offers a 20% to 50% improvement in performance over a simple round-robin approach.

4.7 Computing Support for the Testbed: Hardware and Communications

Our current computing support has been adequate for our recent research and development needs. As we move into the next phase of the Alexandria Project, however, it is clear that we need to resolve two sets of needs relating respectively to storage and to servers, as well as considering another upgrade of the communications facilities at UCSB.

We first describe our current equipment and then describe our equipment needs.

4.7.1 Current Computing Equipment

We have installed and are using 4 DEC Alpha workstations and 2 microcomputers. The Alphas were obtained with major discounts from our partner, DEC. These machines, together with existing machines, should be adequate for most of our development needs over the next 6 months. Our current computing equipment inventory includes:

  1. Alpha server;
  2. Alpha Stations;
  3. Laptops;
  4. Pentium PC's;
  5. Mac Laptop;
  6. Sparcserver 1000E.

4.7.2 Current Networking Support

Our current networking facilities include:

  1. internal 10mbps ethernet network;
  2. currently connected to the FDDI campus backbone via FOIRL (Fiber over internet relay line) basically ethernet over fiber;
  3. upgrading to a FDDI connection to the campus FDDI backbone;
  4. the campus backbone is connected to the outside world via a T1 line.

4.7.3 Current Storage Support

Our current storage facilities include:

  1.  28GB RAID;
  2.  104GB Optical Juke Box;
  3.  64GB of disk;
  4.  2 4mm tape drives;
  5.  1 20GB DLT;
  6.  1 8mm tape drive;

We also have an allocation of at least 1 TB of tertiary storage at San Diego Supercomputer Center.

4.7.4 Current "Other" Hardware and Software Support

Our other hardware facilities include

  1.  2 scanners;
  2.  OCR (Optical Character Recognition) Software;
  3.  GIS and Image processing packages;
  4.  Oracle, Illustra, Sybase, O2;
  5.  various Web servers;
  6.  Access to MIL equipment.

4.7.5 Equipment and Facilities Needs

As noted above, mass storage and high-performance servers are two critical areas in which we need to acquire further hardware support.

Need for High-Performance Servers

In order to have a credible presence on the WWW as a testbed library, it is critical that ADL have adequate server performance. This will become an increasingly important issue as the size of our collections grow, as users discover the values of spatially-referenced information, and as we provide more processing capabilities for accessed information. We believe that inadequate service will greatly harm the future of the Alexandria Project and have therefore decided that we will only create a truly public presence with our testbed if we have sufficiently powerful servers.

We are approaching this problem with a two-fold strategy. The first is to employ networks of workstations as parallel computing devices. We have made significant progress in the development of the appropriate technology. The second is to acquire high-performance, off-the-shelf servers. In this regard, we are negotiating with hardware corporations in an attempt to acquire such devices before July 1st 1996.

The Need for A Mass Storage System

An item of some importance for the Project is a mass storage system. The system would provide us with much needed storage for large image and digitized-map items. It would also provide an important research testbed for variety of important issues, such as data placement.

The University has committed funds for a mass storage system for ADL, including $50K from the Library, and $30 from the office of the Vice-Chancellor.

We submitted a proposal in August 1995 to NSF's CISE program for a mass storage device that would have provided ADL with 2TB of storage. The University contribution was the match for a total system cost of $200K.

We now provide the summary of the request to indicate the nature of our needs and our justification:

We request help in acquiring a mass storage device. The device is essential for research in designing, implementing, and testing a digital library (DL) that supports massive amounts of spatially-indexed information.

The equipment is an AML/J robotic archive, with an Archive Management Unit, a 10-cartridge Insert/Eject Unit, and two DLT 4000 tape drives, modified to operate in an automated library environment. The archive is configured to store 100 DLT 4000 Cassettes. Each DLT 4000 tape drive has a sustained transfer rate of 1.5 MB/msec and each cartridge stores up to 20GB.

The two principle projects for which the equipment will be used to support research are the Alexandria Digital Library (ADL) Project and the Computational Modeling Systems (CMS) Project. The goal of the ADL Project is to build a distributed DL for spatially-indexed materials. The goal of the CMS Project is to provide distributed computational support for data intensive modeling activities in the Earth sciences. The two projects are distinct but complementary.

The device will be used to support the "real" research activities of several applied research projects, including the applied part of CMS. The usage patterns of these projects will provide critical information that will be used to formulate and resolve issues arising from the use of mass storage technology in DL contexts.

We believe that only by having access to a "real", end-to-end DL, with major collections of digital materials supported by appropriate mass storage, will we be able to both formulate and resolve many of the major research issues underlying DL operations.

Unfortunately, our request for funding was not successful.

We are currently looking for alternative forms of funding that will allow us to purchase a mass storage system. We have set a target of summer 1996 as a time for the acquisition of such a system.

The Need for Higher-Speed Communications

While the communications links to ADL have been upgraded with the help of the University to a T1 link during the past year, the sizes of the information items that users will access and download from ADL undoubtedly will require much greater communications bandwidth.

There are two initiatives currently underway that could help with this issue. The first is the submission of a proposal to NSF that is being coordinated for DLI members and others by William Arms. The second is a proposal that involves that seeks major infrastructure support by Oak Ridge National Laboratory that would benefit UCSB as a partner. The infrastructure includes very high-speed communications links and high-performance computational support. ADL is a party to this proposal, and the most immediate benefit would be a high-speed communications link.


Home Alexandria Digital Library: ANNUAL REPORT Prev Next
Last modified on 1996-02-27 at 18:19 GMT by the Alexandria Web Team