4 CURRENT STATUS OF TESTBED DEVELOPMENT
We now describe the main aspects of the development of our
testbed system in terms of the current version of the WP. We describe
the system first in relation to its general, four-component architecture;
we then describe in greater detail each of these four components;
and finally, we describe our activities in terms of two cross-cutting
technologies, namely image processing and parallel processing.
We note that this description involves a mixture of purely developmental
activities and research activities, since these two sets of activities
are essentially in separable.
4.1 Architecture of the Testbed System
The current testbed system, which is revised version of our
initial WP, is based on the four-component architecture illustrated
in Figure , derives from the four major
components of a traditional library. This simple and robust architecture
has been of value for both the RP and WP systems.
The catalog component includes metadata and search
engines that permit users to identify holdings of interest. The
storage component contains the digital holdings, organized
into collections. The user interface supports graphic and
text-based access to the other ADL components and services. Librarians
use the ingest component to Store new holdings, extract
metadata from the holdings, and add metadata to the catalog.
The WP architecture is a special cases of the general architecture,
with differing languages and protocols at the component interfaces.
Figure also illustrates the languages and
protocols employed in the WP. Unlike the RP version of the ADL,
the WP storage and catalog components are distributed.
The catalog component of ADL permits users to map their information
requirements into the most appropriate set of information in the
collections of ADL. While a traditional library cataloging system
(author-title-subject) provides a basic model for a DL's catalog
component, it is inadequate for geographically-referenced holdings
such as maps and images. Catalogs for geographically-referenced
information must additionally support access to holdings in terms
of their representations, their spatial "footprints,"
and their contents.
DL technology greatly increases our ability to extract, store,
and search new classes of metadata about library holdings. A major
thrust of ADL activity is thus to extend current models of catalogs
and metadata. Also, to support catalog interoperability, the ADL
is employing standards to represent and exchange catalog information.
To meet these criteria we developed a catalog schema for
the RP using elements from the USMARC and Federal Geographic Data
Committee (FGDC) metadata standards. We then expanded the schema
for the WP to include metadata supporting simple content-based
queries.
4.2.1 Basic Metadata: USMARC and FGDC Standards
The basic metadata for geographically-referenced information
in both the RP and WP systems combines elements from the USMARC,
and the Federal Geographic Data Committee (FGDC) metadata standards.
Since the late 1960s, USMARC has been a national standard
for database descriptions of library holdings. It includes fields
for cataloging analog geographic data, as well as open-ended "local-use"
fields that can be made to accommodate digital data. As of 1995,
the USMARC standard contains the full FGDC standard and, unlike
FGDC, includes thesauri (e.g. for subject headings).
USMARC stores all metadata for a given holding in a single record with four components: leader, record directory, control fields, and variable fields. This "flat" structure, while not optimal for a relational database, is useful for specifying metadata input/output functions and for exchanging metadata records between different DLs.
The FGDC promotes the coordinated development, use, sharing,
and dissemination of surveying, mapping, and related spatial data.
Its metadata standard for digital geospatial data has been mandated
for use by all U. S. Federal agencies.
Relative to USMARC, the FGDC standard only provides definitions
for a small number of fields, and their logical relations in a
hierarchical structure. While these fields are adequate for cataloging
digital geospatial data, they do not accommodate analog spatial
materials. Moreover, the FGDC standard does not specify a particular
format or structure for metadata representation, resulting in
a variety of implementations and a lack of generic import/export
functions.
By combining the FGDC and USMARC standards, the ADL has been
able to catalog all forms of spatial data thus far encountered,
including remote-sensing imagery, digitized maps, digital raster
and vector datasets, text, videos, and remote WWW servers. The
metadata schema for the WP has approximately 400 fields, including
all FGDC fields and selected USMARC fields. To create the schema,
we converted the FGDC production rules and USMARC record hierarchy
into a single normalized entity-relationship (ER) data model,
from which the physical database schemata are generated automatically
by CASE tools.
4.2.2 The ADL Gazetteer
The WP catalog incorporates two major extensions to the combined
FGDC-USMARC metadata model, both supporting forms of content-based
search. The first extension allows digital image holdings to be
searched for occurrences of preselected image features, such as
textures, and is more fully described below.
The second extension allows ADL holdings to be retrieved
based on the relationship between the footprints of the holdings
and the footprints of named geographic features (such as cities,
rivers, and mountains.) Lists of such features and their footprints
are commonly called gazetteers. In addition to feature
names and footprints, gazetteers include varying detail about
the type of each feature, often organized as a class hierarchy
(e.g., hydrographic_feature.watercourse.ephemeral_stream.)
The ADL gazetteer is a hybrid of two large standard digital
gazetteers, maintained respectively by the USGS Geographic Names
Information System (GNIS) and the Board of Geographic Names (BGN).
The GNIS gazetteer contains about 1.8M names of US features, organized
hierarchically into 15 classes of features, while the BGN gazetteer
contains approximately 4.5M names of land and undersea features.
The ADL gazetteer is a union of the GNIS and BGN names and footprints,
and an intersection of their feature classes.
The ADL gazetteer is maintained in the ADL catalog database,
but is also available externally for search by the Excalibur(ConQuest)
semantic network text-retrieval engine. We have found the ConQuest
version of the gazetteer useful for "fuzzy" gazetteer
searches, where a user may not know the precise spelling or configuration
of a gazetteer feature (e.g., "Santa Barbara Airport"
versus "Santa Barbara Municipal Airport").
There are two significant research issues associated with
the use of gazetteers in the ADL. First, different gazetteers
use different terminologies and hierarchies to describe the same
features. So far we have been able to construct our "crosswalks"
between gazetteers by consulting their reference documents, but
this will not be possible with (for example) historical placename
lists.
A second issue is the nature of a named feature's footprint.
For example, what is the footprint of "Santa Barbara"
- is it the city limits or the benchmark at City Hall or
the county boundary? Existing gazetteers often provide only point
locations for both point and area features. It is often unclear
how the points are chosen, and whether they are centroids, corners,
or some arbitrarily chosen point. To complicate matters, there
is the issue of how to deal with useful named features that have
only "fuzzy" footprints (e.g., "Southern California",
"Sierra Nevada"). In cases like these, ambiguity and
fuzziness are inherent in a person's notion of the spatial extent
of a feature, and they are particularly difficult to specify.
4.2.3 Other Catalog Issues
As the ADL catalog grows, spatial indexing methods play an
increasingly important role in supporting footprint queries. We
are investigating various methods for indexing multidimensional
hierarchical data (see the paper by Kothuri and Singh, 1995) such
as footprints. In particular, we have extended B-trees to "IB-trees",
which accommodate objects that span a range of values (intervals)
rather than single values (points) in the data space. IB-trees
decompose d-dimensional data objects into d intervals, one per
dimension, and index the intervals in each dimension separately.
Although the primary external interface to the ADL is the
WWW, the ADL catalog also supports a Z39.50 interface. Z39.50
is the standard online protocol for traditional library catalogs,
and is also the current standard search protocol for the National
Spatial Data Infrastructure (NSDI). The NSDI is being coordinated
by the FGDC as a collection of Z39.50 servers supporting queries
against FGDC-compliant metadata.
4.3 Collections
The focus to date of ADL has involved the development of collections sufficient to demonstrate the functionality of the library. Hence the collections are still relatively small.
We have loaded the following data sets:
We are currently loading a multi-gigabyte data set of ecological
information from the Sierra Nevada Project (SNEP).
4.4 The User Interface
The ADL's user interface enables:
The RP's UI, based on the GIS software package ArcView, supported
the first three of these functions. In the remainder of this section
we address issues that arose in migrating the ADL to the WWW environment,
and in adding significant functionality beyond the RP.
4.4.1 User Interface Issues
The ADL WP must operate within the following WWW limitations:
No WWW browser that we know of supports vector data display,
nor does HTML make any explicit provision for vector data. This
is a serious issue, given the large fraction of vector data in
the ADL collections. Vector data input, such as defining a geographic
search region, is possible only with great difficulty. A natural
procedure would be to draw a polygon on a base map by either
clicking on multiple points or clicking and dragging over a desired
region, but these actions are not supported by current WWW browsers,
which immediately send an HTTP request after a user input event
(e.g., mouse click.)
HTTP's statelessness hinders browsing and searching. By default,
once a server responds to a client's HTTP request, neither the
client nor the server retains any state or "memory"
of the transaction (other than perhaps logging the URL involved).
This makes it difficult to implement such essential features as
per-user configurations and iteratively-refined searches. To simulate
a stateful connection (e.g., a "session"), information
must be explicitly maintained by either the client (in parameters
stored in the URL or "hidden" HTML form variables) or
the server (in unique user identifiers and a session database.)
To tailor its services to different levels of users, a UI
should be user-customizable, and be able to save a particular
configuration for use in future sessions. Additionally a user
must be able to retrieve a particular data item or metadata record.
Since the WWW is part of the Internet, simple bulk retrieval via
FTP is straightforward to implement. As noted above, however,
the holdings in the ADL are often extremely large, so methods
that allow for the extraction and progressive transfer of relatively
small increments of data holdings are also required.
4.4.2 User Interface Implementation
Conceptually, the WP UI is a collection of HTML "pages"
implementing three major search capabilities:
As well as control/configuration and help/glossary links.
The UI is designed around a state transition model with each state
representing a WWW form or page, some of which include partial
or complete query results. The HTML code for the WP UI is generated
dynamically by approximately 15K lines of Tcl code running in
a NaviServer HTTP server.
The primary function of the both the map browser and the gazetteer pages is to allow the user to define spatial extents or regions for catalog searches. The map browser allows these search regions to be defined explicitly (by zooming and panning a base map), while the gazetteer defines them implicitly (as the footprints corresponding to place names and feature types.) Figure shows a screen dump of the map browser.
The visible portion of the map browser's base map (the display
window) is the default search footprint (the query window),
but this relationship can be modified (e.g., the user may specify
a subset of the display window, or may direct that the display
window be completely ignored.) The base map is also the background
on which the gazetteer and catalog query result footprints are
drawn. The base map images are dynamically generated by a Common
Gateway Interface (CGI) application based on the Xerox PARC Map
Viewer [http://www.parc.xerox.com/map/], which we have modified
to support generic labeling, fast panning, and graphic overlays.
Gazetteer queries may interact with the map browser. For
example, if the current map browser query window contains the
USA but not Europe, then a gazetteer query with the place name
set to "Paris" (and the query window enabled) will return
Paris, Texas but not Paris, France. The map browser, in turn,
may be directed to reset the query window to the minimum bounding
geographic rectangle for the gazetteer query results.
Query windows resulting from gazetteer-map browser interactions
are ultimately passed to the catalog page for incorporation into
catalog queries. In addition to geographic footprints, the catalog
page allows the user to search against any of the metadata fields
(such as theme, time, or author) in the ADL catalog, expressed
as textual or numeric values.
Catalog queries are assembled from user input into a generic
conjunctive normal form (CNF) representation, and then translated
to the specific query language (currently SQL) of the catalog
DBMS . Query results are converted to HTML tables, with hyperlinks
to browse images and online holdings. Query results are presented
incrementally, with a subset of the metadata fields displayed
initially and complete fields subsequently displayed for user-selected
holdings. The format and fields used in the query results are
completely user-configurable.
Queries may also return the footprints of ADL holdings, which may be displayed on the map browser base map. Unfortunately, it is common for many more footprints to be returned from a catalog query than can be shown intelligibly on the map browser's relatively small display. When footprints of multiple data holdings are displayed on the same map, it is difficult to distinguish which footprint is associated with which item. We continue to experiment with heuristics and visual aids (such as clustering and labeling) for disambiguating "crowded" footprints. In Figure we show examples of the browse graphics that may be returned as the partial results of a query.
The WP UI stores all user configuration parameters, query
statements, and current query result sets in a separate (from
the catalog) database maintained by the NaviServer HTTP server.
This state information may also be stored on request on the client
side in "hidden" HTML form variables. This allows a
user to save an ADL "session" by using the browser's
save-page feature. The session may be restored by reloading the
saved page. Otherwise, state maintenance is handled entirely by
the server, with only a minimal opaque handle used on the client
side to identify the current session.
4.5 Image Processing
We are applying image processing technologies to a range of ADL issues. Image processing has implications for efficient storage, access, and retrieval of DL holdings.
Bandwidth and/or storage limitations often make it impractical
to retrieve a large image from a DL as a single item. Furthermore,
different users may desire different levels of image resolution.
A general solution to these problems is to maintain hierarchical,
multi-scale representations of image data. Our particular solution
is to employ wavelet transforms.
Wavelets have been widely used in many image-processing applications,
including compression, enhancement, reconstruction, and image
analysis. Fast algorithms exist for computing the forward and
inverse wavelet transforms, and desired intermediate levels can
be easily reconstructed. The transformed images (wavelet coefficients)
also map naturally into hierarchical storage structures.
We are also applying image processing techniques to the general
problem of content-based access to DL holdings. Our current implementation
uses texture as a basis for describing and cataloging the
content of a library of images.
4.5.1 Browsing and Progressive Delivery
A useful property of the wavelet decomposition is that the
lowest-resolution components may be used as "thumbnail"
images for browsing. Experience with thumbnails in the RP convinced
us that they are invaluable for browsing through large numbers
of images and making rapid "go/no-go" evaluations. With
wavelets we can support a richer browsing model in which users
may zoom in on a given region until they have reached an acceptable
level of detail. Wavelet transformations support the progressive
delivery of these images, with rapid delivery of both the low-resolution
browse images and the incremental higher-resolution components.
Current WWW browsers cannot display wavelet data directly. The WP gets around this restriction with a customized "helper application" invoked by the client browser whenever it receives an image of MIME type "wavelet". The helper application retains the previously downloaded components, so that the WP UI need only transmit the "next" component in response to a request for higher-resolution data.
The helper application is not our long-term preferred solution
for wavelet display, since it requires us to make a locally-developed
executable program available for any possible ADL client hardware/software
environment. A better solution, which we are currently pursuing,
is to develop an inverse wavelet transform as an "applet"
in a portable language such as Java, that can be downloaded into
a standard WWW browser such as Netscape.
4.5.2 Texture-Based Retrieval
Content-based retrieval is critical for accessing large collections
of digital images. The ADL is investigating the use of texture
as a basis for content-based search (see the paper by Ma and Manjunath,
1995) initially by adding catalog indices based on image texture
features. Specifically, texture information is extracted from
images as they are ingested, using banks of Gabor (modulated Gaussian)
filters. This is roughly equivalent to extracting lines, edges,
and bars from the images, at different scales and orientations.
Simple statistical moments (e.g., mean and standard deviation)
of the filtered outputs are then used for matching and indexing
images.
The WP catalog includes a database of texture templates which
can be matched against actual textures extracted from ADL collection
holdings. One example of the class of accesses enabled by this
information is initiating a search by choosing an image region.
The region's texture will be used to retrieve matching texture
templates, which in turn reference the ADL holdings in which they
occur.
Figure is an example of browsing aerial
photographs using subsampled (reduced resolution) versions of
the image (on the left) and retrieval using query patterns. Users
can browse through the thumbnails (top left) of the air photos,
each of which are about 5K x 5K. A more detailed (1/10th resolution)
image is shown on the right. Subregions can be selected from this
larger image and similar looking patterns can be retrieved. The
figure shows an example in which a parking lot is selected along
with the dictionary code word and the retrieved patterns.
4.6 Parallel Processing
The Alexandria Project is investigating parallel computation
(see the paper by Andresen, Egecioglu, Ibarra, and Poulakidas,
1995) to address various performance issues, including multiprocessor
servers, parallel I/O, and parallel wavelet transforms, both forward
(for image ingest) and inverse (for efficient browsing of multi-scale
images.)
We have developed a prototype parallel HTTP server containing
a set of collaborative processing units, each of which is capable
of handling a user request. The distinguishing feature of the
server is resource optimization based on close collaboration of
multiple processing units. Each processing unit is a workstation
(e.g. SUN SPARC or a Meiko CS-2 node) linked to a local disk.
The disks are NFS-mounted to all processing units. Resource constraints
affecting the performance of the server are:
Actively monitoring the CPU, disk I/O, and network loads
of the system resource units, and then dynamically scheduling
incoming HTTP requests to the appropriate node, keeps the server's
performance relatively insensitive to request load, while allowing
it to scale upwards with additional resources. In simulations,
the round-trip total response time in seconds is improved significantly
with the use of multiple processing units, and does not change
significantly when the request rate increases, even into the range
of 5 to 30 million per week.
We have observed similar speedups using a multi-node server
when varying the size of retrieved image files (typical ADL holdings.)
Since the computational and I/O demands of requests to the ADL
vary dramatically for large images and complex metadata, the load-balancing
approach offers a 20% to 50% improvement in performance over a
simple round-robin approach.
4.7 Computing Support for the Testbed: Hardware and Communications
Our current computing support has been adequate for our recent
research and development needs. As we move into the next phase
of the Alexandria Project, however, it is clear that we need to
resolve two sets of needs relating respectively to storage and
to servers, as well as considering another upgrade of the communications
facilities at UCSB.
We first describe our current equipment and then describe
our equipment needs.
4.7.1 Current Computing Equipment
We have installed and are using 4 DEC Alpha workstations
and 2 microcomputers. The Alphas were obtained with major discounts
from our partner, DEC. These machines, together with existing
machines, should be adequate for most of our development needs
over the next 6 months. Our current computing equipment inventory
includes:
4.7.2 Current Networking Support
Our current networking facilities include:
4.7.3 Current Storage Support
Our current storage facilities include:
We also have an allocation of at least 1 TB of tertiary storage
at San Diego Supercomputer Center.
4.7.4 Current "Other" Hardware and Software Support
Our other hardware facilities include
4.7.5 Equipment and Facilities Needs
As noted above, mass storage and high-performance servers
are two critical areas in which we need to acquire further hardware
support.
Need for High-Performance Servers
In order to have a credible presence on the WWW as a testbed
library, it is critical that ADL have adequate server performance.
This will become an increasingly important issue as the size of
our collections grow, as users discover the values of spatially-referenced
information, and as we provide more processing capabilities for
accessed information. We believe that inadequate service will
greatly harm the future of the Alexandria Project and have therefore
decided that we will only create a truly public presence with
our testbed if we have sufficiently powerful servers.
We are approaching this problem with a two-fold strategy.
The first is to employ networks of workstations as parallel computing
devices. We have made significant progress in the development
of the appropriate technology. The second is to acquire high-performance,
off-the-shelf servers. In this regard, we are negotiating with
hardware corporations in an attempt to acquire such devices before
July 1st 1996.
The Need for A Mass Storage System
An item of some importance for the Project is a mass storage
system. The system would provide us with much needed storage for
large image and digitized-map items. It would also provide an
important research testbed for variety of important issues, such
as data placement.
The University has committed funds for a mass storage system
for ADL, including $50K from the Library, and $30 from the office
of the Vice-Chancellor.
We submitted a proposal in August 1995 to NSF's CISE program
for a mass storage device that would have provided ADL with 2TB
of storage. The University contribution was the match for a total
system cost of $200K.
We now provide the summary of the request to indicate the
nature of our needs and our justification:
We request help in acquiring a mass storage device. The
device is essential for research in designing, implementing, and
testing a digital library (DL) that supports massive amounts of
spatially-indexed information.
The equipment is an AML/J robotic archive, with an Archive
Management Unit, a 10-cartridge Insert/Eject Unit, and two DLT
4000 tape drives, modified to operate in an automated library
environment. The archive is configured to store 100 DLT 4000 Cassettes.
Each DLT 4000 tape drive has a sustained transfer rate of 1.5
MB/msec and each cartridge stores up to 20GB.
The two principle projects for which the equipment will
be used to support research are the Alexandria Digital Library
(ADL) Project and the Computational Modeling Systems (CMS) Project.
The goal of the ADL Project is to build a distributed DL for spatially-indexed
materials. The goal of the CMS Project is to provide distributed
computational support for data intensive modeling activities in
the Earth sciences. The two projects are distinct but complementary.
The device will be used to support the "real"
research activities of several applied research projects, including
the applied part of CMS. The usage patterns of these projects
will provide critical information that will be used to formulate
and resolve issues arising from the use of mass storage technology
in DL contexts.
We believe that only by having access to a "real",
end-to-end DL, with major collections of digital materials supported
by appropriate mass storage, will we be able to both formulate
and resolve many of the major research issues underlying DL operations.
Unfortunately, our request for funding was not successful.
We are currently looking for alternative forms of funding
that will allow us to purchase a mass storage system. We have
set a target of summer 1996 as a time for the acquisition of such
a system.
The Need for Higher-Speed Communications
While the communications links to ADL have been upgraded
with the help of the University to a T1 link during the past year,
the sizes of the information items that users will access and
download from ADL undoubtedly will require much greater communications
bandwidth.
There are two initiatives currently underway that could help
with this issue. The first is the submission of a proposal to
NSF that is being coordinated for DLI members and others by William
Arms. The second is a proposal that involves that seeks major
infrastructure support by Oak Ridge National Laboratory that would
benefit UCSB as a partner. The infrastructure includes very high-speed
communications links and high-performance computational support.
ADL is a party to this proposal, and the most immediate benefit
would be a high-speed communications link.