next up previous
Next: IMAGE PROCESSING TEAM Up: COMPARISON OF ACTUAL Previous: INTERFACE DESIGN AND

INFORMATION SYSTEMS TEAM

IS1
RESEARCH AND DEVELOPMENT TASK: Distributed Catalog architecture

PLANNED ACTIVITY

The Team will design a distributed architecture for Alexandria in which the data may be stored across several remote sites, and in which a query may be routed through several constituent catalogs until it is finally processed. Goals for this design include the ability to partition a catalog among several different servers for load balancing and load distribution, as well as replicating frequently accessed catalog items to reduce bottlenecks. To further improve load utilization, the design will provide caching of queries, parts of the catalog as well as other intermediate information at the client site. Furthermore, the Team will explore various options for designing an ``intelligent store'', which handles not only data, but also methods that can be executed on the data at the storage site. The goal is to reduce unnecessary communication and computation overhead at the catalog/client sites. Typical user activities will be taken into account in the design of the distributed architecture and, in particular, the fact that much of a user's time is spent browsing for data before the occurrence of long duration full image retrievals of successfully-located items. The design must therefore support this usage pattern by providing many small intermediate results (during browsing), before initiating a single large data result. The architectural design for a distributed catalog system will be passed to the Testbed Team for implementation.

ACTUAL ACTIVITY

The team has prototyped a messaging layer and dynamic object model for the Data Store component of the Distributed Catalog. The messaging layer of this project addresses the requirements of the Distributed Catalog architecture, namely, distributed processing, interoperability, and dynamic Alexandria objects. A paper describing this design and implementation has been submitted for publication.

IS2
RESEARCH TASK: Multidimensional indexing

PLANNED ACTIVITY

The Team is continuing to experiment with Level-based Interval B-trees (LIB trees), which were proposed earlier. So far, the Team has concentrated on uniformly generated data and used the index structure in memory. The index structure is now being placed on a disk, using real data from the Alexandria Gazetter, and comparing its performance with existing index structures such as R*-trees. As part of this experiment, the size of the cache (for holding the index structure) is being varied and its impact upon performance is being observed.

By February 1997, it is expected that look-ahead and page replacement strategies for caching index-structures will be investigated, with the goal of balancing the computation with the I/O so that both may proceed in parallel. This will be done for LIB-trees, R* trees, and possibly other indexing schemes. After due investigation, these will be incorporated into the Alexandria system. Finally, the Team will investigate the performance of the index structures on parallel systems, with parallel disk I/O. Issues such as segmentation of an index structure, its placement on different disks, and balancing of the computation and I/O on different nodes will become important here.

ACTUAL ACTIVITY

The Team improved the idea of LIB structures to promotion-based indexing, carrying out experiments on disk for gazetteer and catalog. It also demonstrated improvements over existing index structures. The results could not be incorporated into Illustra because Illustra does not yet permit the addition of an external index structure. Parallel processing was used to improve indexing performance and the results were reported in [1]. Besides parallel query processing, some preliminary experiments were also conducted on parallel creation/updating of multidimensional indices. In terms of new results, a lower bound of was established for linear-space, tree-structured index structures that hold n d-dimensional objects.

IS3
RESEARCH TASK: Content based indexing / browsing

PLANNED ACTIVITY

Content-based retrieval of images often involves searching a large number of multi-dimensional vectors for similar vectors. The team has been exploring various dimension-reducing techniques based on mathematical transforms, including the Fourier transform, discrete cosine transforms, and single value decompositions. The Team is currently evaluating their recall and previous results with respect to sets of real images in the Alexandria collection. A next step is to incorporate these results in the construction of an index structure that exploits the properties of the transforms. In the long term, it is hoped that these results will be incorporated into the distributed Alexandria architecture to support efficient browsing.

ACTUAL ACTIVITY

The Team has completed evaluating 2 different types of dimension-reducing techniques upon image feature vectors. One usual problem associated with content-based retrieval is called the ``curse of dimensionality''. In order to obtain good retrieval results, large number of features are required in a feature vector. These large-dimension feature vectors, however, are difficult to index. To address this problem, the team performed Discrete Fourier Transform and Singular Value Decomposition upon feature vectors generated from an album of texture patterns called the Brodatz album. After performing each transformation, the vectors were truncated to serial lengths to measure standard retrieval performance: recall and precision. The team has found that though SVD processing provides better approximation for feature vectors and thus better retrieval performance overall, DFT processing yields better results for queries of 20 nearest neighbors. This is a range that is suitable for browsing the Alexandria catalog. It is hoped that this result can be incorporated into the Alexandria architecture for browsing the catalog.

IS4
RESEARCH TASK: Data placement / storage hierarchy

PLANNED ACTIVITY

The Team is currently evaluating different methods of placing the wavelet decomposed components of images on magnetic disks. The different strategies explore declustering the data in different ways on and across the disks. The current setup consists of a client-server application whereby the clients make requests for image thumbnails or the complete images, which are handled by the server. The server makes use of a disk simulator (which simulates the HP97560 disks) to determine the response times for various placement strategies.

The current objective is to expand on the work already done in this direction last year (95). The earlier work viewed the whole population of clients in an aggregated manner by simulating them as a single client process and the images were decomposed by only one level. In order to better simulate the expected scenario, multiple clients must be introduced. The Team plans to extend this by working with multiple clients and multiple levels of resolution.

At present the only simulated part of the whole setup is the disks. Once placement strategies have been tested using the disk simulator, the next step is to move to real disks in the Unix domain. The performance of these strategies in a real setting would be a reality check for the results obtained earlier. Furthermore, and although the current model does incorporate a real network, the effects of the network are not being aggressively studied. It is therefore planned to study these effects in more detail once real disks are in use.

Due to the size and number of images that will be handled by Alexandria, it will be necessary to store a major part of the data on tertiary storage; the data on magnetic disks will act as a cache for the tertiary storage. Therefore, the longer term plans include designing models for tertiary storage and incorporating these into the placement model. New placement strategies for data that will reside on tertiary storage will be devised and tested. It is expected that actual tertiary systems will be eventually incorporated into the study.

ACTUAL ACTIVITY

Simulations were done for both single and multiple client settings. The results have been reported in a technical report (TRCS96-22). The Team ported and extended placement policies to Unix disks. Network effects were naturally included in the system simulations. A technical report describing the results of the placement experiments has been prepared and submitted for publication. Studying the inclusion of tertiary storage in this system was not done. Instead, the following problems of tertiary storage systems were addressed:

  1. general investigation of the problem of tertiary storage. A report summarizing the current technologies and research in the field was prepared and will be presented at ACM SAC 97.
  2. The problem of I/O scheduling for automatic tertiary storage libraries was investigated. Optimal and near-optimal solutions were found. A report summarizing the results has been prepared (TRCS96-26) and submitted for publication. Also, the Team started investigating the problems of 2-dimensional data prefetching and placement of multidimensional index structures.
At present, interactions are ongoing with the San Diego Supercomputing Center to obtain trace data for their tertiary storage system.

IS5
RESEARCH TASK: Performance of Catalog

PLANNED ACTIVITY

The development of the testbed catalog system has already revealed a serious issue concerning the degradation of its performance as the database grows to a realistically-sized collection of information for the catalog and gazetteer. The overall objective of the research is to improve the performance of the catalog system. The Team will achieve this goal through a careful re-examination of the current database schema, the analysis of queries (classifications and frequencies), and possibly an improvement of the process of query translation (from web interface to SQL). The work will involve two stages, corresponding approximately to the short and medium term aspects of the schedule shown below. In the first stage, the Team will focus on studying and analyzing the existing schema/implementation and develop initial approaches for further investigation. In the second stage, the Team plans to verify and improve the initial optimization methods through experiments, possibly involving a small suite of DBMS that are candidates for full operational use. Optimization alternatives will be produced for implementation by the end of the year.

ACTUAL ACTIVITY

This task has yet to be completed.

ADDITIONAL ACTIVITIES

We note that new tasks that were not listed in the previous plan were undertaken. In particular, the research initiative concerning Pharos and Resource Discovery originated during the past year.



next up previous
Next: IMAGE PROCESSING TEAM Up: COMPARISON OF ACTUAL Previous: INTERFACE DESIGN AND



Terence R. Smith
Thu Feb 20 13:50:53 PST 1997