The Information Systems team developed Pharos: a scalable, distributed architecture for locating heterogeneous information (document) sources [10]. The system incorporates a hierarchical metadata structure into a multi-level retrieval system. Queries are resolved through an iterative decision-making process. The first step retrieves coarse-grain metadata, about all sources, stored on local, massively replicated, high-level servers. Further steps retrieve more detailed metadata, about a greatly reduced set of sources, stored on remote, sparsely replicated, topic-based mid-level servers. Preliminary simulations which indicate the feasibility of the architecture were carried out. These tests are currently being run on our new high-end DEC alpha. The team has described the structure, distribution, and retrieval of the metadata in Pharos to enable users to locate desirable information sources over the Internet.
Such architectures must scale well in terms of information gathering with the increasing diversity of data, the dispersal of information among a growing user base, and the visualization of results with the growing data volume. Pharos scales in all of these aspects. The use of a hierarchical metadata structure greatly enhances scalability because it provides for a hierarchical network organization. Our particular hierarchical structure increases scalability additionally because it does not grow as a function of the number of documents in the collection [7]. The Pharos distribution scheme also enhances scalability: each user begins with the metadata available at a local high-level server. Moreover, since the mid-level metadata queries are distributed over a number of topic-based servers, there is little or no network bottleneck. Each source determines which taxonomies most appropriately fit its data domain. Diverse information is more easily classified as the number of taxonomies grow, including, for example, image and sound domains. Furthermore, each source determines when to update its own metadata.
By collecting several taxonomy-independent factors such as network parameters and the size of a collection, Pharos allows users to broaden their search criteria beyond those allowed by term-matching techniques. The linear rankings of sources provided in standard vector-based text retrieval techniques is generalized to allow users greater freedom in visualizing and selecting criteria of their preferred sources. Users can use relevance feedback techniques as a result of the iterative query methodology. Pharos is also extensible in that new taxonomies can be introduced as needed, either as part of an accepted Pharos standard such as the LC Classification, or as part of an independent group which defines its own taxonomy within which to classify documents. An initial evaluation of Pharos, both in terms of simulation and in comparison to other models [7], has shown the architecture to be sufficiently successful to warrant further development.