The Alexandria project has been initiated to build a digital library for map and satellite images. Designed for content-based retrieval, the relevant information in each image is encoded in the form of a multi-dimensional feature vector. Though representing images by feature vectors greatly facilitates user queries, indexing these vectors degrades performance when the number of dimensions is large. We consider 2 popular techniques (DFT and SVD) to reduce the dimension of feature vectors, and study their retrieval performance with respect to recall and precision. We find that though SVD generally out-performs DFT, DFT compares favorably in a limited range suitable for browsing large image databases.
The Alexandria Digital Library Project has been tasked with the goal of building a digital geographical library. To meet these requirements we have designed and prototyped an intelligent Data Store to store its holdings; the library's map, image, and geographical data are viewed as a collection of distributed objects. Developed using the Java language, the Data Store was designed to address the problems associated with digital library storage: interoperability, extensibility, distribution, and elimination of server bottlenecks.
This report summarizes current state of the art in tertiary storage systems. We begin with a comprehensive discussion of magnetic tape and optical storage technologies. This is followed by a classification of commercial products based on their performance characteristics. Our analysis of product data indicates that in contrast to disk technology, tertiary storage products have significant variability in terms of data transfer rates as well as other performance figures. We then summarize efforts in the areas of operating systems, databases and advanced applications to integrate tertiary storage. We point out that different assumptions about the underlying technology result in entirely different algorithms and system design. We conclude the report with a speculation of future trends.
This paper presents a summary of some of the work-in-progress within the Alexandria Digital Library Project. In particular, we present scalable methods of locating information at different levels within a distributed digital library environment. Starting at the high level, we show how queries can be routed to appropriate information sources. At a given source, efficient query processing is supported by using materialized views and multidimensional index structures. Finally, we propose solutions to the problem of storage and retrieval of large objects on both secondary and tertiary storage devices.
With rapid advances in computer and communication technologies, there is an increasing demand to build and maintain large image repositories. In order to reduce the demands on I/O and network resources, multiresolution representations are being proposed for the storage organization of images. Image decomposition techniques such as wavelets can be used to provide these multiresolution images. The original image is represented by several coefficients, one of them with visual similarity to the original image, but at a lower resolution. These visually similar coefficients can be thought of as thumbnails or icons of the original image. This paper addresses the problem of storing these multiresolution coefficients on disks so that thumbnail browsing as well as image reconstruction can be performed efficiently. Several strategies are evaluated to store the image coefficients on parallel disks. These strategies can be classified into two broad classes depending on whether the access pattern of the images is used in the placement. Disk simulation is used to evaluate the performance of these strategies. Simulation results are validated with results from experiments with real disks and are found to be in good agreement. The results indicate that significant performance improvements can be achieved with as few as four disks by placing image coefficients based upon browsing access patterns.
This paper presents three broad classes of network architecture that support discovery of information sources. Relevant concepts such as query routing and the extraction, propagation, and retrieval of metadata are defined. Based on these concepts, different models of locating and querying relevant information sources are presented. Finally, we estimate several important characteristics of these models and classes as well as their expected scalability.
Efficient browsing and retrieval of geographically referenced information requires the allocation of data on different storage devices for concurrent retrieval. By dividing a two dimensional space into tiles, a system can allow users to specify regions of interest using a query rectangle and then retrieving all information related to tiles overlapping with the query. Necessary and sufficient conditions were derived for strictly optimal allocations of two-dimensional data. These methods, when they exist, guarantee that for any query, the minimum number of tiles are assigned the same storage device, and hence ensures maximal retrieval concurrency.
With the recent improvements in network and processor speeds, digital libraries and other data intensive applications have become much more feasible than ever before. The only practical solution for their enormous storage needs is tertiary storage. Automated access to tertiary storage is made possible through tape and optical disk robotic libraries. The slow speeds of operation of the drives and library robotics imply large access times. This high access latency and the bursty nature of I/O traffic result in the accumulation of I/O requests at the storage library. We study the problem of scheduling these requests to improve performance. We focus on scheduling policies that process all requests on a loaded medium before unloading it. For single drive settings an efficient algorithm that produces optimal schedules is developed. For multiple drives the problem is shown to be NP-Complete. Efficient and effective heuristics are presented for the multiple drive case. The scheduling policies developed achieve significant performance gains over more naive policies. The algorithms developed are simple to implement and are not restrictive. The study is general enough to be applicable to any storage library handling removable media, such as tapes and optical disks.
Pharos is a scalable distributed architecture for locating heterogeneous information sources. The system incorporates a hierarchical metadata structure into a multi-level retrieval system. Queries are resolved through an iterative decision-making process. The first step retrieves coarse-grain metadata, about all sources, stored on local, massively replicated, high-level servers. Further steps retrieve more detailed metadata, about a greatly reduced set of sources, stored on remote, sparsely replicated, topic-based mid-level servers. We present results of a simulation which indicate the feasibility of the architecture. We describe the structure, distribution, and retrieval of the metadata in Pharos to enable users to locate desirable information sources over the Internet.