INFORMATION SYSTEMS TEAM
[Abstracts of published articles]

[1]
K V Ravi Kanth, Divyakant Agrawal, Amr El Abbadi, Ambuj K Singh, and Terence R. Smith, "Parallelizing multidimensional index structures", IEEE Symposium on Parallel and Distributed Processing, New Orleans, pp. 376-383, October 1996. Indexing multidimensional data is inherently complex leading to slow query processing. This behavior becomes more pronounced with the increase in database size and/or number of dimensions. In this paper, we address this issue by processing an index structure in parallel. First, we study different ways of partitioning an index structure. We then propose efficient algorithms for processing each query in parallel on the index structure. Using these strategies, we parallelized two multidimensional index structures -- R* and LIB and evaluated the performance gains for the Gazetteer and the Catalog data of the Alexandria Digital Library on the Meiko CS-2.

[2]
Efficient Retrieval for Browsing Large Image Databases, D. Wu, D. Agrawal, A. El Abbadi, A. Singh, T. R. Smith, Proc of the 5th International Conf on Information and Knowledge Management 96, pp 11-18, Rockville, Maryland, Nov 12-16, 1996.

The Alexandria project has been initiated to build a digital library for map and satellite images. Designed for content-based retrieval, the relevant information in each image is encoded in the form of a multi-dimensional feature vector. Though representing images by feature vectors greatly facilitates user queries, indexing these vectors degrades performance when the number of dimensions is large. We consider 2 popular techniques (DFT and SVD) to reduce the dimension of feature vectors, and study their retrieval performance with respect to recall and precision. We find that though SVD generally out-performs DFT, DFT compares favorably in a limited range suitable for browsing large image databases.

[3]
A Java-Based Architecture for Digital Library Storage, D. Wu, D. Agrawal, A. El Abbadi, A. Singh, Proc of SPIE, Photonics East, Voice, Video, and Data, Boston Mass, Nov 23-24, 1996

The Alexandria Digital Library Project has been tasked with the goal of building a digital geographical library. To meet these requirements we have designed and prototyped an intelligent Data Store to store its holdings; the library's map, image, and geographical data are viewed as a collection of distributed objects. Developed using the Java language, the Data Store was designed to address the problems associated with digital library storage: interoperability, extensibility, distribution, and elimination of server bottlenecks.

[4]
"A Brief Survey of Tertiary Storage Systems and Research" S. Prabhakar, D. Agrawal, A. El Abbadi and A. Singh to appear in ACM Symposium on Applied Computing (ACM SAC '97).

This report summarizes current state of the art in tertiary storage systems. We begin with a comprehensive discussion of magnetic tape and optical storage technologies. This is followed by a classification of commercial products based on their performance characteristics. Our analysis of product data indicates that in contrast to disk technology, tertiary storage products have significant variability in terms of data transfer rates as well as other performance figures. We then summarize efforts in the areas of operating systems, databases and advanced applications to integrate tertiary storage. We point out that different assumptions about the underlying technology result in entirely different algorithms and system design. We conclude the report with a speculation of future trends.

[5]
"Scalable Access Within the Context of Digital Libraries" X. Cheng, R. Dolin, R. Kothuri, M. Neary, S. Prabhakar, D. Wu, D. Agrawal, A. El Abbadi, A. Singh, T. Smith and J. Su to appear in Advanced Digital Libraries Forum, 97.

This paper presents a summary of some of the work-in-progress within the Alexandria Digital Library Project. In particular, we present scalable methods of locating information at different levels within a distributed digital library environment. Starting at the high level, we show how queries can be routed to appropriate information sources. At a given source, efficient query processing is supported by using materialized views and multidimensional index structures. Finally, we propose solutions to the problem of storage and retrieval of large objects on both secondary and tertiary storage devices.

[6]
"Browsing and Placement of Multiresolution Images on Secondary Storage" S. Prabhakar, D. Agrawal, A. El Abbadi, A. Singh and T. Smith, submitted for publication.

With rapid advances in computer and communication technologies, there is an increasing demand to build and maintain large image repositories. In order to reduce the demands on I/O and network resources, multiresolution representations are being proposed for the storage organization of images. Image decomposition techniques such as wavelets can be used to provide these multiresolution images. The original image is represented by several coefficients, one of them with visual similarity to the original image, but at a lower resolution. These visually similar coefficients can be thought of as thumbnails or icons of the original image. This paper addresses the problem of storing these multiresolution coefficients on disks so that thumbnail browsing as well as image reconstruction can be performed efficiently. Several strategies are evaluated to store the image coefficients on parallel disks. These strategies can be classified into two broad classes depending on whether the access pattern of the images is used in the placement. Disk simulation is used to evaluate the performance of these strategies. Simulation results are validated with results from experiments with real disks and are found to be in good agreement. The results indicate that significant performance improvements can be achieved with as few as four disks by placing image coefficients based upon browsing access patterns.

[7]
"Classifying Network Architectures for Locating Information Sources", R. Dolin, D. Agrawal, and A. El Abbadi, Proceedings of the Fifth DASFAA Conference", April 1997.

This paper presents three broad classes of network architecture that support discovery of information sources. Relevant concepts such as query routing and the extraction, propagation, and retrieval of metadata are defined. Based on these concepts, different models of locating and querying relevant information sources are presented. Finally, we estimate several important characteristics of these models and classes as well as their expected scalability.

[8]
Optimal Allocation of Two-dimensional Data, K. Abdel-Ghaffar and A. El Abbadi. Proceedings of the 6th International Conference on Database Theory (ICDT'97), Delphi, Greece, January 1997, pp. 409--418.

Efficient browsing and retrieval of geographically referenced information requires the allocation of data on different storage devices for concurrent retrieval. By dividing a two dimensional space into tiles, a system can allow users to specify regions of interest using a query rectangle and then retrieving all information related to tiles overlapping with the query. Necessary and sufficient conditions were derived for strictly optimal allocations of two-dimensional data. These methods, when they exist, guarantee that for any query, the minimum number of tiles are assigned the same storage device, and hence ensures maximal retrieval concurrency.

[9]
"Scheduling Tertiary I/O in Digital Libraries", S. Prabhakar, D. Agrawal, A. El Abbadi and A. Singh submitted for publication.

With the recent improvements in network and processor speeds, digital libraries and other data intensive applications have become much more feasible than ever before. The only practical solution for their enormous storage needs is tertiary storage. Automated access to tertiary storage is made possible through tape and optical disk robotic libraries. The slow speeds of operation of the drives and library robotics imply large access times. This high access latency and the bursty nature of I/O traffic result in the accumulation of I/O requests at the storage library. We study the problem of scheduling these requests to improve performance. We focus on scheduling policies that process all requests on a loaded medium before unloading it. For single drive settings an efficient algorithm that produces optimal schedules is developed. For multiple drives the problem is shown to be NP-Complete. Efficient and effective heuristics are presented for the multiple drive case. The scheduling policies developed achieve significant performance gains over more naive policies. The algorithms developed are simple to implement and are not restrictive. The study is general enough to be applicable to any storage library handling removable media, such as tapes and optical disks.

[10]
Pharos: A Scalable Distributed Architecture for Locating Heterogeneous Information Sources, R. Dolin, D. Agrawal, L. Dillon, A. El Abbadi, Computer Science Department, University of California, Santa Barbara, Technical Report Number TRCS96-05, 1996.

Pharos is a scalable distributed architecture for locating heterogeneous information sources. The system incorporates a hierarchical metadata structure into a multi-level retrieval system. Queries are resolved through an iterative decision-making process. The first step retrieves coarse-grain metadata, about all sources, stored on local, massively replicated, high-level servers. Further steps retrieve more detailed metadata, about a greatly reduced set of sources, stored on remote, sparsely replicated, topic-based mid-level servers. We present results of a simulation which indicate the feasibility of the architecture. We describe the structure, distribution, and retrieval of the metadata in Pharos to enable users to locate desirable information sources over the Internet.