PLANNED ACTIVITY
The Team will design a distributed architecture for Alexandria
in which the data
may be stored across several remote sites, and in which a query may be
routed through several constituent catalogs until it is finally processed.
Goals for this design include the ability to partition a catalog among
several different servers for load balancing and load distribution, as well
as replicating frequently accessed catalog items to reduce bottlenecks.
To further improve load utilization, the design will provide caching of
queries, parts of the catalog as well as other intermediate information
at the client site. Furthermore, the Team will explore various options for
designing an ``intelligent store'', which handles not only data, but
also methods that can be executed on the data at the storage site. The goal
is to reduce unnecessary communication and computation overhead at the
catalog/client sites.
Typical user activities will be taken into account
in the design of the distributed architecture
and, in particular, the fact that much of a user's time
is spent browsing for data before
the occurrence of long duration full image retrievals of
successfully-located items. The design must therefore
support this usage pattern by
providing many small intermediate results (during browsing), before
initiating a single large data result.
The architectural design for a distributed catalog system
will be passed to the Testbed Team for implementation.
ACTUAL ACTIVITY
The team has prototyped a messaging layer and dynamic object
model for the Data Store component of the Distributed Catalog.
The messaging layer of this project addresses the requirements
of the Distributed Catalog architecture, namely, distributed
processing, interoperability, and dynamic Alexandria objects.
A paper describing this design and implementation has been
submitted for publication.
PLANNED ACTIVITY
The Team is continuing to experiment with Level-based Interval B-trees (LIB trees), which were proposed earlier. So far, the Team has concentrated on uniformly generated data and used the index structure in memory. The index structure is now being placed on a disk, using real data from the Alexandria Gazetter, and comparing its performance with existing index structures such as R*-trees. As part of this experiment, the size of the cache (for holding the index structure) is being varied and its impact upon performance is being observed.
By February 1997, it is expected that look-ahead and
page replacement strategies for caching index-structures
will be investigated, with the
goal of balancing the computation with the I/O
so that both may proceed in parallel.
This will be done for LIB-trees, R* trees, and possibly
other indexing schemes. After due investigation, these
will be incorporated into the Alexandria system.
Finally, the Team will investigate the performance of
the index structures on parallel systems, with parallel
disk I/O. Issues such as segmentation of an index
structure, its placement on different disks, and balancing
of the computation and I/O on different nodes will
become important here.
ACTUAL ACTIVITY
The Team improved the idea of LIB structures to promotion-based
indexing, carrying out experiments on disk for gazetteer
and catalog. It also demonstrated
improvements over existing index structures.
The results could not be incorporated into Illustra
because Illustra does not
yet permit the addition of an external index structure.
Parallel processing was used to improve indexing
performance and the results were reported in [1].
Besides parallel query processing, some preliminary
experiments were also conducted
on parallel creation/updating of multidimensional indices.
In terms of new results, a lower bound of
was established for linear-space, tree-structured index structures
that hold n d-dimensional objects.
PLANNED ACTIVITY
Content-based retrieval of images often involves searching a large number of
multi-dimensional vectors for similar vectors. The team has
been exploring
various dimension-reducing techniques based on mathematical transforms,
including the Fourier transform,
discrete cosine transforms, and single value decompositions.
The Team is
currently evaluating their recall and previous results with respect to sets
of real images in the Alexandria collection. A next step is to
incorporate these results in the construction of
an index structure that exploits the
properties of the transforms.
In the long term, it is hoped that these results will be
incorporated into the distributed Alexandria architecture to
support efficient browsing.
ACTUAL ACTIVITY
The Team has completed evaluating 2 different types of
dimension-reducing techniques upon image feature vectors. One usual
problem associated with content-based retrieval is called the ``curse of
dimensionality''. In order to obtain good retrieval results, large number
of features are required in a feature vector. These large-dimension
feature vectors, however, are difficult to index.
To address this problem, the team performed Discrete Fourier Transform
and Singular Value Decomposition upon feature vectors generated from
an album of texture patterns called the Brodatz album. After performing
each transformation, the vectors were truncated to serial lengths to
measure standard retrieval performance: recall and precision. The
team has found that though SVD processing provides better approximation
for feature vectors and thus better retrieval performance overall, DFT
processing yields better results for queries of 20 nearest neighbors.
This is a range that is suitable for browsing the Alexandria catalog.
It is hoped that this result can be incorporated into the Alexandria
architecture for browsing the catalog.
PLANNED ACTIVITY
The Team is currently evaluating different methods of placing the wavelet decomposed components of images on magnetic disks. The different strategies explore declustering the data in different ways on and across the disks. The current setup consists of a client-server application whereby the clients make requests for image thumbnails or the complete images, which are handled by the server. The server makes use of a disk simulator (which simulates the HP97560 disks) to determine the response times for various placement strategies.
The current objective is to expand on the work already done in this direction last year (95). The earlier work viewed the whole population of clients in an aggregated manner by simulating them as a single client process and the images were decomposed by only one level. In order to better simulate the expected scenario, multiple clients must be introduced. The Team plans to extend this by working with multiple clients and multiple levels of resolution.
At present the only simulated part of the whole setup is the disks. Once placement strategies have been tested using the disk simulator, the next step is to move to real disks in the Unix domain. The performance of these strategies in a real setting would be a reality check for the results obtained earlier. Furthermore, and although the current model does incorporate a real network, the effects of the network are not being aggressively studied. It is therefore planned to study these effects in more detail once real disks are in use.
Due to the size and number of images that will be handled by Alexandria,
it will be necessary to store a major part of the data on tertiary storage;
the data on magnetic disks will act as a cache for the tertiary storage.
Therefore, the longer term plans include designing models for
tertiary storage and incorporating these into the placement model.
New placement strategies for data that
will reside on tertiary storage will be devised and tested. It is expected
that actual tertiary systems will be eventually incorporated into the study.
ACTUAL ACTIVITY
Simulations were done for both single and multiple client settings. The results have been reported in a technical report (TRCS96-22). The Team ported and extended placement policies to Unix disks. Network effects were naturally included in the system simulations. A technical report describing the results of the placement experiments has been prepared and submitted for publication. Studying the inclusion of tertiary storage in this system was not done. Instead, the following problems of tertiary storage systems were addressed:
PLANNED ACTIVITY
The development of the testbed catalog system has already
revealed a serious issue concerning
the degradation of its performance as the database
grows to a realistically-sized
collection of information for the catalog and gazetteer.
The overall objective of the research is to improve
the performance of the catalog system. The Team will achieve this goal
through a careful re-examination of the current database schema, the
analysis of queries (classifications and frequencies), and possibly an
improvement of the process of query translation (from web interface to
SQL). The work will involve two stages, corresponding approximately to
the short and medium term aspects of the schedule shown below.
In the first stage, the Team will focus on
studying and analyzing the existing schema/implementation and develop
initial approaches for further investigation. In the second stage, the Team
plans to verify and improve the initial optimization methods through
experiments, possibly involving a small suite
of DBMS that are candidates for full operational use.
Optimization alternatives will be produced for implementation
by the end of the year.
ACTUAL ACTIVITY
This task has yet to be completed.
ADDITIONAL ACTIVITIES
We note that new tasks that were not listed in the previous plan were undertaken. In particular, the research initiative concerning Pharos and Resource Discovery originated during the past year.