5 RESEARCH ACTIVITIES AND PROGRESS
As noted previously, it is virtually impossible to separate in a meaningful manner the progress in our research activities from the progress of our development activities, since the two sets of activities are highly coupled. In particular, much of the output of research efforts has taken the form of implementations in the various testbed systems.
While some of the results of our research were described above
in relation to testbed development, we now summarize our main
research results in terms of the activities of each of the research
and development teams.
Before describing our research activities, it is important
to note that a major contribution of our research activities
in the first full year of the Project was the identification of
important research issues. This output of our research is laid
out in our plans for research and development in Annual Program
Plan (see below).
5.1 LIBRARY TEAM
Membership: M. Goodchild (leader), Carver, Geffner,
Gottsegen, Kemp, Kothuri, Larsgaard, Simpson, Smith
This Library Team is responsible for investigating a variety
of issues relating to the nature of the ADL collection and to
their characterization in terms of metadata in the catalog component.
These issues include important problems relating to the integration
of spatially referenced information objects into ADL. Subteams
of the Library Team are investigating issues relating to the design
and construction of an "Alexandria Atlas" and issues
relating to the nature and representation of metadata as well
as catalog interoperability
User requirement specifications
The Team conducted an examination of specification of user requirements
for a digital library with access to spatial materials. An initial
concern was ensuring that the WP preserved as much of the functionality
of the RP as possible, while extending it significantly. As well
as a specification of user requirements for a gazetteer, the Team
investigated other requirements concerning user search, and in
particular, the expansion and narrowing of search in relation
to the definition of themes. This research was stimulated by work
in other DLI projects to investigate sources of information that
can be used to build networks and spaces that capture the relationships
between themes and feature types, and that might support search
expansion. Expansion and narrowing were also investigated in a
geographic context based on geographic hierarchies.
Research on gazetteers for ADL
An important requirement determined by the Team concerned the
addition of a second method for specifying the locational component
of a query: namely through the use of a gazetteer and named places.
Hence the specification of user requirements included those for
the functionality and design of a gazetteer (defined as an index
connecting names of features to geographic location.) A survey
of existing digital gazetteers was undertaken as a basis for constructing
a prototype digital gazetteer for the Web Presence. Basic issues
researched included: defining the gazetteer, examining the suitability
of existing sources, determining appropriate extensions of the
concept of a gazetteer in a DL context; models of feature extent
that can be used to drive base map displays and queries; and hierarchical
structures that can be coded into digital gazetteers to enhance
functionality. A draft specification for a gazetteer implementation
was developed.
Several major sources of digital gazetteers were found to suffer
in two areas: lack of information on physical extents of features
that can be used to help define the area of search; and lack of
features that are important for querying but not well enough defined
for traditional purposes. These "fuzzy" features include
informally defined geographic areas such as neighborhoods and
regions. The Team investigated a range of compromises that would
allow the Web Presence prototype to offer functionality in these
areas.
Fuzzy footprints
In relation to the research on gazetteers, the Team investigated
the concept of fuzzy footprints. In particular, the concept was
investigated by Dan Montello from the perspective of methods for
eliciting geometric definitions of fuzzy regions from human subjects;
methods for storing geometric definitions in digital form; and
methods for executing queries based on fuzzy regions.
Survey of standards and protocols
A survey of standards and protocols impacting Alexandria was undertaken.
A document detailing such standards and protocols is being revised
on a continuing basis.
Survey of Web applications for geospatial data
A comprehensive survey of existing Web applications in the general
area of geospatial data was developed.
User/librarian discourse analysis
The Team investigated methods for capturing knowledge about the
discourse between user and librarian, as a source of information
on likely queries for the Information Systems Team, and as a basis
for user interface design for the Interface Design and Evaluation
Team. A specification document was developed.
5.1.1 METADATA AND CATALOG INTEROPERABILITY SUBTEAM
Membership: Smith (leader), Geffner, Gottsegen, Gritton,
Hill, Larsgaard
A relatively new subteam of the Library Team has been investigating
two issues of major importance for the ADL catalog. The first
concerns the construction of fundamental models of catalog metadata
and the second concerns the development of models for catalog
interoperability in terms of exchange.
Knowledge representation and a general model of metadata
The subteam has conducted a survey of knowledge representation
languages to determine their suitability for respesenting and
exchanging metadata in DL catalogs. The subteam has constructed
a general model of metadata based upon the knowledge representation
schema of representational structures, and has produced
an initial catalog design based upon this model. The initial investigations
of this subteam led to a successful proposal to the Central Imagery
Office (CIO) to examine both of the issue of general models for
metadata and semantic exchange of metadata in great depth over
the next three years.
5.1.2 ALEXANDRIA ATLAS SUBTEAM
Membership: Carver (leader), Frew, Goodchild, Kemp,
Larsgaard, Simpson, Smith
Another relatively new subteam of the Library Team has been
investigating the design and functionality of an "atlas"
that would support graphical/geographical access to library materials,
in a manner that greatly generalizes our current map browsers.
Design and construction of an "Alexandria Atlas"
The Team has defined the requirements and functionality of an
"atlas" that would act as a graphical/geographical interface
supporting direct access to a large variety of materials by geographical
reference. As well as identifying requirements, the team has acquired
selected datasets, identified high level tools, and constructed
the beginnings of an "electronic atlas requirement."
The datasets already acquired are: the Digital Chart of the World,
the Vector Shoreline Dataset, the Digital Line Graphs for the
United States and, the three arc second Digital Elevation Models
for the United States. Other datasets are being sought. The atlas
gazetteer is close to completion and has more than 6.5 million
entries. A live link between the map interface and the gazetteer
will be constructed as part of the implementation. Extent from
the above vector datasets will also be added to the gazetteer
database so gazetteer features may be accessed directly from the
map interface. The design for this component is underway. A draft
requirement for atlas functionality is near completion. Work assignments
will be made when this document is complete.
5.2 INTERFACE DESIGN AND EVALUATION TEAM
Membership: Montello (co-leader, UCSB), Buttenfield
(co-leader, Colorado), Carver, Dillon, Dolin, Green, Kumler, Larsen,
Larsgaard, Nishikawa, Rae, Simpson
The main function of the Team is to investigate issues relating
to the design of the user interface, the functionality of the
system, and the evaluation of the system from the users' point
of view.
The original Interface Design Team and the User Evaluation
Team were combined to form the Interface Design and Evaluation
Team during the preceding year. Although they both operated as
two teams for over half of the year, we have combined descriptions
of the research activities of the two teams into a single description.
Survey of Commercial GUI development packages
The Team conducted a detailed survey and comparative study of
many commercial GUI development packages. The purpose of the study
was to: formulate a set of evaluation criteria; determine a suitable
metrics for evaluation; evaluate a number of GUI development packages
to determine their strength/weakness and their suitability for
adoption in the Alexandria project; and to provide a recommendation
and justification of the development platforms. A report was prepared.
User interface requirements The Team investigated the issue
of user interface requirements relating to four categories (conceptual,
functional, operational, and developmental) and formulated an
ADL interface requirement document. The document is based on the
requirements and functionalities of the RP and includes as core
functionality:
The WP design was checked against the requirements document
to ensure that all functionality requirements were met. Specifications
of UI requirements in four categories (conceptual, functional,
operational, and developmental) have been formulated.
Concepts and constructs for user-defined browsing
The team has investigated and developed initial concepts and constructs
for user defined browsing activities. Unlike many existing ad
hoc approaches, a framework has been designed that unifies both
information retrieval (queries) and presentation (browsing as
a special case) functionalities. In particular, the framework
supports incremental query construction, together with automated
assistance in specifying some query parameters. The framework
is being integrated with the constructs developed in the data
modeling aspect into a concrete data model, suitable for Alexandria
collections and operations, as well as a more general class of
applications involving spatial data.
Concepts for an ADL OO data model
The team has investigated a conceptual data model based on an
object-oriented paradigm for the Alexandria collections. It integrates
metadata and data, and is conformant with the hierarchical structure
of many collections. To accommodate any diversity in the underlying
coordinate system of spatial data collections (e.g., non-geospatial
data), the model allows a partition of objects into a set of "worlds,"
each supporting its own coordinate system. Initial designs of
the associated query language feature flexibility, ease of use,
and high expressive power. The query language supports incremental
query construction, together with automated assistance in specifying
some query parameters.
Model of a "virtual library"
The Team constructed a "virtual" library model that
supports uniform access to (possibly foreign) library collections
on the WEB and permits spawning of advanced UI features as needed.
The virtual library model includes provisions for the construction
of personal catalogs that refer to an individual's items of greatest
interest.
Redesign of Original User Evaluation Plan
Following the February site visit at UCSB, reviewer comments indicated
concern about the intended user evaluation program. Specific points
were raised about creating a separate version of the Rapid Prototype
to test user interface modules. The site evaluation team suggested
direct user testing of the UNIX version be undertaken. After discussion,
we accepted this position, discarding the original approach. Much
of the spring activity involved an entire revamping of the user
evaluation plan, and expansion of the User Evaluation Team at
UCSB. We feel that the redesign provides a much more robust user
evaluation program.
The revised user evaluation effort is a multi-phased approach
integrating and expanding upon library requirements research and
on quantitative paradigms commonly applied in other disciplines
(for example, Education, Human-Computer Interaction, and Information
Science). These paradigms include real-time transaction logging
of system use, online user surveys and tutorials for the Alexandria
software, and embedding user- activated buttons into the Alexandria
testbed to annotate specific events in user sessions. All three
types of tools have been successfully pre-tested in Buffalo and
the more informative tools are being implemented in the Web version.
At UCSB, an ethnographic study of library patron behavior has
begun at the Map and Imagery Library (MIL). Details on these efforts
are described below. New members have joined the User Evaluation
Team including two key personnel at Santa Barbara, and three key
personnel at the University of Colorado-Boulder (CU).
In December, User Evaluation Team Leader Barbara Buttenfield moved
from Buffalo to Colorado, taking the Buffalo subcontract of the
Alexandria Project with her. CU has provided temporary lab space
while construction of permanent lab space is completed. CU has
also met and exceeded the match originally offered by SUNY-Buffalo,
including a reduced teaching load. Colorado granted Dr. Buttenfield
the first semester on leave with full pay to insure that project
momentum is not hampered by the move. Other matching funds from
CU include two graduate research assistant positions for the duration
of the Project, laboratory space and equipment. The CU Library
is providing partial release time for the Map Librarian to work
on Alexandria, and will purchase a small amount of computer equipment
to support Alexandria testing sites in one or more Libraries on
campus.
Interactive Transaction Logging
The Buffalo team coded interactive transaction logging functions
and initially embedded in the UNIX Rapid Prototype running locally
at Buffalo and at UCSB. Following several months of subject pre-testing,
the transaction logs have been refined, and we have begun to implement
them in the Web testbed this winter. The logs record and timestamp
the sequence of specific icons and tools called by the user. The
transaction log is not tied to screen pixels but to system commands
and objects on the screen, thus capturing a higher level of user
behavior than proposed a year ago. We additionally log which Alexandria
windows are active, and record the names of archived image files
as they are opened. We created a brief tutorial to guide new users
through the library, and can inspect user logs to determine that
the tutorial is being utilized, and what are the patterns of use
and of use error. For example, we determined that many users are
confused in using the selection pad. The transaction logs show
users "clicking twice" on thumbprint icons, applying
the Macintosh metaphor (clicking twice on an icon to open the
file associated with that icon). The user interface utilized a
separate menu tool for opening files, and transaction logs showed
users consistently repeating the double click sequence in lieu
of the correct menu tool. Thus we discovered a specific place
to streamline the interface design.
Results of early pretests indicated that users would utilize cognitive
affect buttons in the Alexandria menu. We embedded three such
buttons initially: a Good button, a Bad button, and a NotePad
button. Users were instructed in the tutorial to click on the
affect buttons when they particularly liked or disliked something
about the interface, and (optionally) to insert a comment annotating
their opinion. Use of the affect buttons is included in the transaction
log, thus we can monitor where in a sequence of events a user
is delighted, confused, or frustrated by the system interface,
network response, data holdings, and so forth. The cognitive affect
buttons are currently embedded in the Alexandria Web testbed.
Ethnographic Studies of Library Patron Behavior
Judith Green has begun an ethnographic analysis of library users.
We see the work as a formative evaluation that is provided to
inform as the other teams are revising and redefining the interface
and the system. Three analyses have been undertaken.
First, the team has analyzed the demonstration protocol. We identified
issues of accessibility to the language and the content of the
information currently on the web. We have identified "insider"
language that is used by members of the culture of ADL that are
not accessible to "outside" audiences, actual and potential
ADL users. We did this by giving a group of potential users the
material and asking them to identify terms that were problematic
or strange to them, concepts or phrasing that was troublesome
to them as readers, and information that they needed. This information
will be given to the development team by Monday. The study is
called, ADL-Speak. We have ten participants in this study. We
have reported some of our findings to the users group already
and they have used them as scenarios to begin to revise the front
matter to make it more user friendly. The outcome of this study
was discussion of and agreement on the need to include "hot
buttons" to allow people access to a glossary and the need
for information that will provide an overview of capacity for
the user.
Second, the team is collecting data on users with different levels
of expertise, of transcribing their comments as they think aloud
about what they are doing during their search, and of creating
tapes that illustrate the problems. The tapes will provide feedback
to the user group. We transcribe the tapes to identify problems,
successes, and strategies needed in building an accessible interface.
The first tape was a two hour session with someone who has knowledge
of the web and web searching but not knowledge of ADL. This tape
provided insight into problems of access, needed areas of information,
and issues of needed tutorials. This tape will provide input into
the development of the tutorials. We are currently analyzing,
transcribing and collecting data on knowledgeable users-members
of the development team and a reference librarian. These tapes
will allow us to identify insider knowledge and what sophisticated
users understand, expect, and do as they search. The contrast
across user groups will provide a basis for identifying insider
knowledge so that we can make the library content accessible to
a range of users.
The third study is of the reference librarians and how they conduct actual reference interviews. the librarians are taping actual reference interviews for us. These will be transcribed and analyzed and the findings from this used to inform the development of the interface and library access group's work (UIE). This phase is just beginning.
We plan to bring in a broad range of users from grade 5 through
adults in order to develop an interface and system that supports
access to a broad range of users.
Preparation of a CD-ROM version of the Rapid Prototype
A different research activity has been supported by efforts at
ESRI, one of our corporate partners. ESRI ported a subset version
of the Alexandria Rapid Prototype over to a Windows platform,
to capture the user audience lacking access to UNIX (as in many
public libraries and elementary schools). The Windows version
was burned on CD ROM in the Fall. The Buffalo Team designed a
Windows version of the UNIX tutorial and a questionnaire which
was included on the CD-ROM. Unfortunately, no interactive logging
was included on the CD ROM version, as several ARC VIEW commands
available on UNIX are not available on the Windows port). UCSB
and Buffalo collaborated to compile a list of names for roughly
2500 copies of the CD to be distributed across the country. The
CDs went out in late Fall, and to date we have received a few
dozen responses by regular mail, and many users have sent comments
and questions to the Buffalo electronic mail account.
5.3 INFORMATION SYSTEMS TEAM
Membership: El Abbadi (leader), Agrawal, Frew, Kothuri,
Prakhabar, Singh, Smith, Su, Wu
Taxonomy of user queries
The Team investigated a taxonomy of user queries.
Survey and evaluations of data models
The Team investigated and evaluated the suitability of a variety
of existing data models in the relational and object-oriented
paradigms that provide support for spatial data. The Team also
examined associated query languages. An investigation was completed
on a generalized relational model which uses geometrical constraints
to provide a finite representation for infinite sets of points
in space (lines, regions, polyhedra). In evaluating query languages
for this model, the use of aggregation operators to compute areas
and volumes was completed.
Suitability of O2 OODBMS
The Team investigated the suitability of the O2 DBMS in which
these structures can be implemented. The goal of the investigation
is to examine the suitability of implementing the Alexandria catalogue
in O2 and integrate it with the planned Web-server for Alexandria.
The reason for this approach is twofold. First, the metadata standards
are intrinsically amenable to OODBMS. Second, an OODBMS implementation
of the Alexandria catalog will provide the opportunity to deal
with multiple as well as heterogeneous servers. The Team has completed
a preliminary object-oriented modeling of FGDC and USMarc metadata
standard for spatial data and investigated the implementation
of the OO schema in O2.
Unifying browsing and retrieval
The Team investigated various concepts in relation to user defined
browsing activities. A framework that unifies both information
retrieval (queries) and presentation (browsing as a special case)
of incremental, assisted query specification was investigated.
The goal was to find a framework that supports incremental, assisted
query specification. The framework was integrated with the constructs
developed in the data modeling investigations into a concrete
data model, suitable for Alexandria collections and operations,
as well as a more general class of applications involving spatial
data. This model supports the notion that browsing is simply a
display of totally ordered elements.
Search of gazetteer
The Team's investigated the gazetteer in relation to the issue
of providing rapid access to collection items that contain named
instances of specific classes of features. The gazetteer for the
WP was initially implemented in an RDBMS (SYBASE). Although the
translation from exact feature names to geographic locations is
fast, SYBASE provided limited functionality ("like"
predicates and "soundex" function) to deal with fuzziness
in feature names. But this limited functionality is either too
slow ("like" predicates) or returns too much unnecessary
information ("soundex"). To effectively deal with fuzziness
in query specification, the gazetteer is now based on the text-processing
package ConQuest. The initial experience of the Team is that this
provides much better performance.
Content based retrieval
The Team investigated mechanisms to facilitate content-based
retrieval in image databases. In particular, evaluations were
made of two approaches to reduce the dimensionality of multidimensional
data: Fourier Transform and Singular Value Decomposition. Feature
extraction is used to summarize image content in terms of multidimensional
vectors. Unfortunately, the dimensionality of these vectors is
typically quite large ranging from 24 to 120. None of the existing
index structures (e.g. R-trees and its variants) can cope with
this dimensionality for both point queries and range queries.
Often, when moving to the image data domain, similarity search
(or range queries) becomes necessary.
One criterion for the goodness of a content-based retrieval is
that there should be no false dismissals while minimizing the
set to avoid false hits. Fourier transforms and Singular Value
Decompositions are being used to reduce the dimensionality of
the image vectors from say 24 to perhaps 4, 6, or 8. Reduced dimension
data using R*-trees and clustering will be used for fast retrieval.
Exhaustive tests were performed to determine the most suitable
technique to implement in the WP.
Indexing methods for spatially-indexed data
In the area of search structures, the Team investigated various
multi-dimensional index structures such as R-trees, R*-trees,
R+ trees, and BV-trees, and completed a preliminary qualitative
analysis of these search structures. Although hierarchical structures
are prevalent in spatial data domains, the issue of indexing for
such nested data has received little attention in the database
and indexing community. Several issues in this regard have been
investigated while designing index structures for hierarchical
data. B-trees and related structures can only index unidimensional
"point" data. The Team extended B-trees (to IB-trees)
to handle data objects that span a range of values rather than
single-valued points in the data space.
Two different approaches were investigated for indexing multidimensional
hierarchical data. The first decomposes the d-dimensional data
objects into d intervals, one per dimension, and indexes the
intervals in each dimension separately. The second approach organizes
all data objects at the same level together using standard spatial
indexing schema.
The Team investigated experimentally the new indexing scheme that
it designed, called "Level-Based Interval B-trees".
Such trees are well-suited for containment queries. It is the
first index structure with logarithmic worst case bounds (in level
of nesting and size of data) for single-dimensional interval data.
In experiments, it has proved to be up to 10 times more efficient
than existing index structures. The proposed index structure also
generalizes very easily to higher dimensions, and it is possible
to get a good speedup on parallel machines. This was shown through
experiments on the Meiko parallel machine. The inherent simplicity
of the design allowed it to be more efficient than a parallel
implementation of R* trees.
Content-based placement for "wavelets" on secondary
storage
The Team investigated content-based image placement and browsing
and investigated and evaluated several strategies for storing
wavelet coefficients on multiple parallel disks so that thumbnail
browsing as well as image reconstruction can be done efficiently.
These strategies can be classified into two broad classes depending
on whether or not the content of the images is used in the placement
of the image coefficients. The simulation results indicate that
if content based retrieval is used to access the images, then
this information should also be used for the placement of images
on disk. In particular, when content-based placement is used to
store image coefficients on disk, performance improvements of
up to 40% are achieved using as few as four disks.
5.4 IMAGE PROCESSING TEAM
Membership: B. Manjunath (leader), Y. Ma, S. Mitra, N. Stroebel, Y. Wang
The Image Processing Team is responsible for investigating
issues concerning the representation, storage, and access of image
related data. The Team also aids the development team in adapting
their research findings and recommendations for the testbed system.
Particular foci of activity for the Team are wavelet decompositions
for storage, manipulation and transmission of images, and access
of images by content.
Image browser
The Team developed and implemented a stand-alone image browser.
It was primarily designed to demonstrate the functionality of
progressive and selective image reconstruction at multiple resolutions.
In addition, different image enhancement methods for zooming and
interpolation were investigated. A more efficient browser which
can be used on an arbitrarily-sized image (or image segment) was
developed for the WP.
An investigation of the browsing tool indicated that fast system
response is more important than accurate image reconstruction
at the intermediate levels. The accuracy of the intermediate representation
depends both on the particular image data as well as the choice
of wavelet filters. A basic problem under investigation is to
quantify the optimality of a given representation.
Optimal wavelets
Although many good wavelet filters for our application have already
been found and tested, the choice of an "optimal" wavelet
remains difficult. As a basic problem, there is a need to establish
the criteria defining the optimality in the context of project
Alexandria. The team has investigated the performance of an optimal
uniform mean square quantizer in representing all wavelet coefficients
to ensure that the disk space necessary for storing a wavelet-based
multiresolution representation does not exceed that of the original
image. In addition, popular wavelet filters have been compared
with respect to their reconstruction performance and computational
complexity. Based on this work the Team has concluded that, for
the ADL application, the Haar wavelet filters offer an appropriate
compromise between reconstruction performance and computational
efforts. Extension of the previous quantization scheme to incorporate
lossless reconstruction is an ongoing activity.
Storage of wavelets
While designing the uniform quantization scheme the Team discovered
that one could store an encoded error image instead of the first-level
wavelet coefficients. The Team therefore investigated the advantages
that this storage modification permits with respect to perfect
reconstruction of the original image within original storage limits
without affecting progressive data transmission.
Texture features for browsing and retrieval
Research on content-based retrieval in image data bases has focused
on using image properties, such as color, texture, histogram,
and shape, for searching through images. The Alexandria Project
has made considerable progress in developing algorithms for texture
based search. These algorithms are being used in implementing
content based search in the web prototype using image texture
as the measure of content.
The Team has investigated and developed an effective texture feature
extraction scheme. The scheme is based on the multiresolution
Gabor wavelet decomposition. Simple statistical moments, such
as the mean and standard deviation of the filtered outputs, can
then be used as indices to search the database. The Team has compared
the performance of different texture features in terms of the
retrieval accuracy and efficiency. These evaluations were performed
using the Brodatz texture album, with over 100 different textures.
Gabor filters demonstrably offer the best performance among the
multiresolution texture features that have been compared (i.e.
the tree structured wavelet transform, the conventional orthogonal
and bi-orthogonal transforms, and the multiresolution autoregressive
model). In order to reduce the image processing time, the Team
developed an adaptive filtering scheme that can be used to reduce
the image processing computations while maintaining retrieval
accuracy and speed.
In relation to the catalog component, methods for indexing library
items using these wavelet based texture features are being investigated.
The Team conducted extensive experiments on the entire Brodatz
texture album and is using the developed methodology in searching
satellite image data. In collaboration with the database researchers,
the team is addressing issues related to indexing and search in
the feature space.
For the WP, the Team has created a design for a database of aerial
photographs which can be searched using texture templates. At
the time of ingest, these images are analyzed and texture information
is extracted. A small set of texture templates are created which
represents the different textures that may occur in these photographs.
At the time of user initiated search, the user can chose a region
of interest, and search the database based on the texture information
within the region.
The Team is currently investigating the use of neural network
based learning algorithms for unsupervised clustering and for
learning suitable distance metrics for image comparisons. Future
research emphases will be on integrating different visual cues
(such as texture, shape, color, etc.) for image retrieval.
5.5 PERFORMANCE AND PARALLEL PROCESSING TEAM
Membership: Yang (leader), Andresen, Egecioglu, Ibarra,
Poulakidas, Srinivasan, Zheng
It is clear that the success of DL's in general, and of ADL
in particular, is heavily dependent on high performance computing.
It is our belief that parallel processing, particularly in the
form of networks of workstations, has an important role to play
in achieving this high performance. The responsibility of the
Performance and Parallel Processing Team is to identify and investigate
aspects of ADL that will benefit from high-performance computing
on multi-computers. In particular, the Team is investigating various
performance issues arising from the ADL environment in terms of
both space and time complexities. It is also developing algorithms
and software techniques for high performance digital libraries.
A scalable WWW server on multicomputers
The Team has investigated issues involved in developing a scalable
WWW server on a cluster of workstations and parallel machines,
using the Hypertext Transport Protocol (HTTP). The main objective
is to improve the processing capabilities of the ADL server by
utilizing the power of multicomputers to match the demands of
simultaneous access requests from the WWW.
The team has developed and implemented a system called SWEB on
a distributed memory machine, the Meiko CS-2, and networked SUN
and DEC workstations. Each processing unit is a workstation linked
to a local disk. The disks are NFS-mounted to all processing units.
Scalability of the server is achieved through effective resource
utilization by actively monitoring the run-time CPU, disk I/O,
network loads of system resource units, dynamically scheduling
user HTTP requests to a proper workstation for efficient processing.
The distinguishing feature of the scheduling scheme is that it
considers the aggregate impact of multiple resource load
factors (e.g. CPU, I/O channels and interconnection network) on
the choice of processor assignment. Previous work typically considered
one resource load factor in the scheduling scheme.
The team has conducted extensive experiments to examine the overall
performance of this system. and tested several performance factors
that affect scalability issues. Among the issues examined, for
example, was how many requests per second could be processed in
delivering regular files, such as image thumbnails or text and
also accessing subregions of compressed wavelet data. The team
also studied the improvement of response time and drop ratios
when the number of server nodes is varied.
Experiments with SWEB indicate that the system provides a sustained
round-trip performance when the number of requests reaches 5 to
30 millions per week. These results have been compared with those
of other approaches. NCSA, for example, has built a multi-workstation
HTTP server based on round-robin domain name resolution to assign
requests to workstations. The round-robin technique is effective
when HTTP requests access to HTML information of relatively-uniform
size chunks. For ADL, however, the computational and I/O demands
of requests may vary dramatically because of the large images
and metadata files of variable sizes, and the round-robin approach
cannot effectively utilize resources. The round-robin approach
has been compared to our load-balancing approach for processing
different ADL-related requests and a 20% to 50% improvement in
performance has been observed.
Fast subregion retrieval, image compression, and parallel wave
transforms
The team has investigated parallel wavelet transforms and related
I/O storage schemes and as well as investigating parallel and
scheduling techniques for supporting parallel image processing.
The forward and reverse transforms have been coded and tested,
yielding superlinear speedup on large images.
Experimental results arising from an implementation of a prototype
of parallel wavelet transformations (forward and reverse) with
support of parallel I/O facilities indicated that the storage
scheme has a significant influence on the design of the algorithm.
Hence a storage and compression scheme was developed in which
compression techniques used by EPIC group (MIT) are combined with
quadtrees. The EPIC group is using a variation of run-length and
Huffman encoding methods to compress the quantized coefficient
matrices created by the wavelet forward transformation.
The reason for using a hybrid coding technique based on quad-tree
and Huffman coding methods is not only to achieve effective image
data compression to save disk space but also to minimize the time
spent in retrieving subregions. The new scheme supports decompression/retrieval
of image subregions in multi-resolution data because subregion
accessing is required for browsing large images and it is impossible
to view an entire image in a single screen. Thus the hybrid code
involves trade-offs between the compression ratio and retrieval
times.
The team conducted experiments with sample satellite images from
the ADL collection. Current results indicate that a 70-90% space
reduction ratio can be achieved for quantized image coefficient
data while the time for accessing a subregion is less than few
seconds using a SPARC 5 with a SCSI-2 disk. These methods have
been incorporated with the quadtree representation so that the
subregion of an image can be reconstructed efficiently but still
sustain a good compression ratio. The various code components
are combined in building an implementation of the storage and
compression scheme and developing a scheme to schedule parallel
I/O accesses and wavelet transform.