Terence R. Smith, Steven Geffner, and Jonathan Gottsegen
Department of Computer Science and Department of Geography
University of California at Santa Barbara
Santa Barbara, CA 93106, USA
e-mail: smithtr@cs.ucsb.edu
We present a general framework to support the modeling of digital documents and user queries in the context of digital libraries (DL's). The basis of the framework is a four-component model of a DL catalog involving a document modeling component, a query modeling component, a match component, and a catalog interoperability component. Meta-information in such a catalog provides models of library documents and facilitates efficient access to information represented in the documents. In particular, meta-information is conceptualized in terms of sets of relations between nominal representations of library documents and their properties, and sets of relations between document properties. The properties of the documents are modeled in meta-information in terms of a multiplicity of languages which vary between the catalog components and between catalogs. Each of the catalog components is modeled in terms of a set of formal systems related to the languages employed in the component. Using this framework, we discuss the two critical issues of catalog intraoperability and catalog interoperability. The framework provides a basis both for the rational design of meta-information and catalogs in DL contexts, and for an analysis and resolution of the intraoperability and interoperability issues. We provide examples of the issues discussed in terms of the Alexandria Digital Library.
The most important issue facing researchers
in the area of digital libraries (DLs)
is to discover mechanisms that support efficient access to
appropriate information.
The use of increased quantities of meta-information
is clearly an important component of such mechanisms.
More important than the ability to generate and handle increasing quantities of meta-information, however, are issues concerning the nature and organization of meta-information, both within the catalog of an individual library ( catalog intraoperability) and between the catalogs of different libraries ( catalog interoperability).
The design and implementation of schemes for meta-information to support efficient access is difficult for many reasons including (1) the diversity and complexity of the aspects of documents modeled by meta-information; (2) the increasing variety in the types of digital documents modeled by meta-information; and (3) the increase in the number of organizations with collections of digital information and associated ``catalogs''. These factors are important in explaining both the current lack of agreement among producers, providers, and users of meta-information concerning its content, representation, or exchange and the diversity of schemes and representations for meta-information.
It currently appears unlikely that a single ``standard'' schema or language for modeling all aspects of digital documents will be designed or accepted by any broad community in the near future. Because of the factors promoting diversity, even the process of developing a variety of standards for relatively restricted communities is very difficult. A major problem confronting researchers attempting to understand the issues of meta-information and efficient access to information is the lack of a general framework in which meta-information and catalogs may be usefully modeled.
It is important, therefore, to develop a general framework that supports the modeling of digital documents and user queries in terms of meta-information, without imposing undue restrictions on the content, representation, or exchange of such models. A valuable application of such a framework would be its ability to facilitate (1) the design of general and extensible models of library documents, user queries, and DL catalogs and a resolution of the issue of catalog intraoperability; and (2) the design of extensible and interoperable DL catalogs and a resolution of the issue of catalog interoperability. The purpose of the current paper is to suggest such framework.
The research literature that provides a foundation for our framework for meta-information modeling and catalog design is relatively large and diffuse. Two important themes that are of particular significance include conceptualizations of meta-information and research on languages that may be used to represent meta-information.
Useful conceptualizations of meta-information range from relatively early and simple ideas, typified for example by research on meta-information for geographic information (Medyckyj-Scott, [20]), to more recent efforts to develop expressive meta-information structures that are capable of handling a wide variety of services such as analysis of the data described by the metadata (Kapetanios and Kramer, [18]) and content based searching over a variety of media (Jain and Hampapur, [15]), (Chu, et al, [9]). This has resulted in partitions of the types of information that metadata should represent. One fairly basic example of such a partition is given by Berard and Keller [5] while Bohm and Rakow provide a partition more consistent with the one given here [6]. On the other hand some researchers are promoting meta-information structures limited to a relatively small number of core elements (Weibel, [25]. Concomitant with the increase in the functions and applications of metadata is an increasing sophistication in views or models of metadata and the types of information that metadata should represent. For example, Lopez and Saacks view metadata as a theory about the underlying data sets. This theory would be useful in "predicting" the need for a particular data set [19]. Lopez and Saacks also describe metadata as providing a knowledge model capturing the semantics of a particular domain. Similarly Hsu, et al. [14] suggest that metadata should provide a representation of various data and knowledge models in a distributed environment, and Kapetanios and Kramer [18] develop the concept of metadata as representations of scientific knowledge structures for data analysis.
These extended conceptions of meta-information require more-powerful languages for representing the meta-information. For example, Jarke, et al., [16] discusses languages explicitly in terms of database programming and knowledge representation languages applied to managing meta-information. The InfoHarness system (Shklar, et al., [22]) specifies a meta-information language that describes catalogs. Other research on algorithmic generation of language translations (Chen, et al., [8]), languages for interoperability (Haines, et al., [13]), and defining semantics of expressions in other languages (Sciore, et al., [21]), are also salient for the language view of meta-information that we take. Of particular interest to this paper is Tamir and Kandel's specification of languages as formal systems [24].
An important issue that does not appear to be well-addressed in the literature, however, concerns a general framework that may be used for modeling meta-information in such contexts as DL's. The current paper is an attempt to redress this apparent lack.
The paper is structured as follows. We first provide a general definition of the concept of meta-information in the context of DL's. We then describe a four-component model of the catalog of a DL that provides useful insights into the nature of meta-information and its representation in DL's. In particular we discuss meta-information in terms of relations between documents and their properties; the use of multiple languages in DL catalogs; the distribution of meta-information over the four components of our catalog model; models of each of these components in terms of formal systems; and the variation among catalogs in terms of the languages they employ in representing meta-information. Finally, we discuss two major issues relating to the design and implementation of DL catalogs, namely the issue of catalog intraoperability and the issue of catalog interoperability.
We illustrate the main ideas
in our framework with examples from
traditional libraries and from
the Alexandria Digital Library (ADL)
for spatially-referenced information (Andresen et. al. [2]).
A body of meta-information may be viewed as a model of some collection of information objects. In the context of scientific databases, for example, Bretherton and Singley [7] define metadata (our meta-information) to be ``information that makes data useful'' and view metadata as information describing the nature of the data and the set of interpretations that may be placed on the data.
In these terms, we view meta-information as information that is (1) stored in the electronic catalogs of digital libraries (DL's); (2) that models both documents and user queries; and (3) whose main purpose is to support efficient user access to information in the documents of a DL. In particular, the models of documents and user queries should support access to information that is represented either explicitly or implicitly in the documents of the library's collections. This purpose leads to a key requirement of meta-information: that it provide a useful model of all aspects of library documents that are relevant in supporting access to appropriate information.
It is useful to specify this requirement in terms of categories that partition these aspects in a natural manner, including (1) the representation of the document, including its physical aspects and its logical aspects; (2) the context of the document, such as author, publisher, lineage; (3) the content of the document; (4) the terms and conditions of access to, and use of, the document; (5) the evaluation of the document, particularly with respect to its value in various applications; and (6) the relations of the document to other documents. The catalogs of traditional libraries, for example, contain document models with minimal representations of (1), (2), and (3). The catalogs of DL's are able to provide more complete models of documents in terms of all six categories.
In the remainder of this section, we present our framework for constructing models of documents and user queries. We begin by presenting a general model of a DL catalog.
In Figure 1 we present a four-component model of a DL catalog that includes:
First, in modeling meta-information for a general case
of our model of a DL catalog,
we view the meta-information
in terms of a set of relations of two basic types:
(1) relations between nominal representations of documents
and representations of properties of the documents;
and (2) relations between representations of
different properties of the documents.
Second, the properties of the documents
are represented in terms of (multiple) languages.
Each component of the catalog
is permitted to employ a multiplicity of languages,
and the sets of languages used in each component
are permitted to differ between components.
Third, the meta-information may be
distributed amongst the four
components of the catalog (including the
query modeling component),
and does not necessarily reside
in the document modeling component alone.
Fourth, we view each of the
four components of a catalog
in terms of a set of formal systems
,
with at least one formal system corresponding to each language
in each component.
Fifth, different catalogs generally
employ different collections of formal systems for modeling
queries and documents,
different schemas and representations for meta-information,
and/or different implementations of such schemas
and representations.
The second and fifth aspects are, in part, a consequence of the forces leading to diversity in schemes for, and representations of, meta-information. We now consider each of these five aspects of our model in greater detail, illustrating each with examples from traditional libraries and ADL.
We employ two classes of relations
in modeling DL meta-information.
First we employ relations between nominal representations of documents
and more detailed representations of various properties
of documents (such as their context and content)
having the form

where
indicates a set of elements.
Second, we employ relations between representations
of the various properties of the documents

Relations of type (1) are critical for
supporting user access, since the inverse relations

may be used to find appropriate documents given user-specified properties and values. Typically, an initial set of relations of type (1) are constructed in the document modeling component of the catalog. Relations of types (1) and (2) may occur in the remaining three components of the catalog.
It is important to note that, in general, the relations between documents and their properties need not be represented in explicit form. In particular, meta-information may be represented in terms of (1) implicit values that are made explicit by the application of some procedure; and (2) the appropriate composition of relations of type (1) and type (2). One may produce additional relations of type (1) by appropriate concatenations of relations of type (1) and (2). There are significant advantages in terms of both storage space requirements and precomputation that result from this fact. The computation of the explicit forms of such relations may often be delayed until query time.
Advantages to viewing meta-information in terms of the two sets of relations are that it (1) emphasizes in a natural manner (inverse relations) that the primary purpose of meta-information is to access documents on the basis of their properties; (2) emphasizes that meta-information does not have to be represented in explicit form (e.g. as a set of tables in an RDBMS), but may be represented implicitly until the time of query evaluation; and (3) provides a natural means of expressing the extensibility of meta-information schemes, since one may specify an extension to a meta-information scheme by simply adding new relations. Implicit representations of meta-information are not possible in traditional library catalogs.
The model of catalog meta-information as a set of relations is clearly applicable to the case of traditional libraries. The catalogs of such libraries may be viewed in terms of relations between nominal representations of documents (such as call numbers) and simple representations of the values of a relatively small number of properties such as author, title, and subject matter. As we note below, however, the current catalog of ADL supports the computation of relations between documents and their properties at query time, rather than having all the meta-information precomputed and stored in explicit form.
In this section, we argue that many of the tools that librarians typically employ in constructing catalogs and in supporting catalog operations may be viewed as definitions of languages whose expressions denote the properties of documents. In traditional libraries such tools include subject headings lists, thesauri, and gazetteers. These tools support the construction of meta-information in which document properties are typically expressed in either natural language (as in the representation of titles, author names, and abstracts) or in highly constrained subsets of natural language (as in the representation of subject headings). This view extends easily to the counterparts and generalizations of such tools in DL's, with the difference that formal languages may also be used.
We argue that catalogs involve a multiplicity of languages with distinct semantics. It follows that a key set of issues for DL catalogs are related to the use of multiple languages for representing the properties of documents and questions of semantic intraoperability and interoperability.
It is possible to specify the languages employed in DL catalogs at a variety of levels of formality. The conceptual level involves relatively informal specifications of the language, typically in terms of natural language, and an informal specification in terms of the cognitive processing of the reader. In particular, expressions in the language are interpreted in terms of the cognitive representation of ``concepts''. The logical level involves a more formal approach in which the semantics of expressions may be defined in terms of the expressions of another (declarative) language. Many of the languages employed in library catalogs, for example, may be specified in terms of a language of the first order predicate calculus (FOPC). In this case, a simple truth value semantics may be appropriate in providing meaning to the expressions of the language. Finally, the physical level involves digital representations of the languages in some computational device, and whose semantics are specified in terms of machine operations. In the following discussion, we focus on the logical level.
We indicate the important roles of languages in the construction and use of catalogs in terms of two examples, the first being traditional thesauri and their generalizations for DL's and the second being the Federal Geographic Data Committee (FGDC) metadata content standard for digitized documents of spatially-indexed information (FGDC [10]). We indicate further examples below.
Traditional thesauri (see ANSI/NISO [3]) specify simple languages whose terms have denotations in limited domains of application. As we note below, however, a thesaurus may also define a metalanguage. This fact has implications for the role of a thesaurus in our four-component catalog model. The specification document for a thesaurus may be viewed as specification at the conceptual level. It is typically impossible to eradicate all ambiguity in relation to the denotations of the terms, which are ultimately provided by the informal interpretations of the cognitive processing of the reader.
Recasting the specification of a thesaurus at the logical level reveals more clearly the nature of the language being defined at the conceptual level of specification. For a typical thesaurus, such a recasting leads to a specification in terms of (1) a set of constant symbols that denote either classes of entities (such as ``igneous rock'') or instances of entity classes (such as ``gabbro''); (2) a set of relation symbols that denote relations between the entities denoted by the constant symbols. Examples of the standard relation symbols include ``narrow-term'', ``broad-term''.
Furthermore, one may recast the definitions of narrow and broad
terms as a set of axioms associated
with the language. For example,
the fact that ``gabbro'' is a narrow term
with respect to ``igneous rock'' may be
expressed in terms of the syntax
of FOPC as:

The linguistic complexity of thesauri
is seen in the
synonymy relationship that is typically an important aspect
of thesauri.
One may view it as part of a metalanguage
that defines aspects of the semantics
of the thesaurus. In this sense, the synonymy relation
between constant symbols for two synonymous terms
is
``
''.
An alternative, and perhaps more useful view, and that is
represents an inference rule of the form:

in which EXP is some expression and
is
semantic equivalence. This inference rule indicates
that one may substitute
for
in the same
expression without changing its meaning.
A thesaurus may be used in different ways in different components of a catalog. The basic language for a thesaurus, for example, may be used both in the document modeling component (e.g. indexing of documents using ``canonical'' terms of the thesaurus) and the query modeling component (e.g. query formulation using ``canonical'' terms of the thesaurus and reformulation of queries using broad-term/narrow-term relations). The metalanguage component of a thesaurus is used mainly in the query modeling component, where expressions employed by users may be replaced using the inference rule (5).
One may perform a similar analysis on the FGDC metadata content standard for digital spatially-referenced information. This standard has been adopted and implemented as a major language in both the document modeling and query modeling components of ADL. One may again represent this language as a special case of a FOPC language. Because the content standard is modeled in ADL in terms of the relational data model, which is in turn implemented in terms of RDBMS, one may view the relational schema as specifying a FOPC language in which the properties of documents are represented. The elements of the relations represented in the language may be viewed as a set of axioms associated with the language, as may constraints that are part of the language specification, while the relational algebra used to manipulate the relations may be viewed as an inference mechanism.
In accordance with our requirement for the generality of our framework we place no constraints on the languages that may be used in constructing the representations of document properties. Various possibilities for the languages include natural languages, subsets of natural languages, and formal languages.
An important reason for requiring a framework in which catalogs may support a multiplicity of languages is the large range of document properties that must be represented in the meta-information of a catalog. This range includes the six general classes of characteristic information mentioned in section 2 above. Since the content of a document, for example, may relate to any phenomenon representable in symbolic or iconic form, it is clear that linguistic expressiveness of great generality is required. This is particularly the case for DL's whose collections involve many classes of documents such as multimedia DL's.
Different languages have different degrees of ``natural'' expressiveness with respect to the phenomena that they are able to represent, and corresponding different costs of usage, particularly in computational environments. While natural language is highly expressive, and is useful for human to human communication and machine to human communication, it is too complex (i.e. costly to use) and too ambiguous in its full generality for human to machine and machine to machine communication.
A common strategy, therefore, is to use (1) constrained subsets of natural language (sometimes termed controlled languages) or (2) formal languages that have a relatively simple syntax and whose expressions have relatively unambiguous interpretations in terms of limited sets phenomena. The first approach typically involves the use of the specialized thesauri developed for limited domains of application.
The use of a variety of relatively simple languages represents a ``divide-and-conquer'' approach to the problem of modeling a diverse array of document properties in the catalog in terms of relatively simple languages. Meta-information may thus be constructed using concatenations of expressions from a variety of languages.
The catalog component of ADL involves multiple languages. First, the document modeling component of ADL currently employs two languages for specifying the relations between documents and their properties. These are languages based on the USMARC specification of meta-information and on the FGDC content standard for digital spatial information. Development plans for ADL involve the use of additional languages, including subject heading languages and various thesauri.
Second, the current query modeling component of ADL not only employs the languages of the document modeling component (i.e. FGDC and USMARC), but also a language based on gazetteers. A gazetteer is essentially a mapping between named features on and near the surface of the Earth and representations of the ``spatial projection'' or footprint of the feature on the surface of the Earth. The language of a gazetteer is used to represent queries in which the user specifies a named feature to be contained in any retrieved map. The footprint of the feature is then ``matched'' with footprints of maps modeled in the meta-information of the document modeling component. Development plans call for the use of languages based on subject headings and thesauri as well as on ``extended thesauri'' containing information on the co-occurrence of expressions in various languages.
Third, the catalog interoperability component of ADL involves languages based on components of the Z39.50 protocol, such as BIB-1 and the proposed GEO-1. Finally, the match component currently involves no languages of its own. This example illustrates the manner in which languages employed in a catalog may vary between the components of a catalog.
An important property of our general framework that may be exploited in the construction of efficient catalogs is that meta-information is not restricted to the document modeling component. It may occur in any of the four components of a catalog.
An example of meta-information that is distributed over catalog components for reasons of efficiency may be found in ADL. While one could construct meta-information models of digitized maps that contained all named examples of earth features occurring in gazetteers, the costs associated with the construction, storage, and searching of these ``records'' would be relatively high. By placing the gazetteer feature-footprint relation in the query modeling component, however, the user can choose the named features to be represented in the maps of interest. Retrieval of the appropriate maps can then be easily computed ``on the fly''.
Another reason for distributing meta-information over components relates to differences in their functionality. The models of documents in the document modeling component may be represented in languages chosen for their efficient representation of document characteristics that are easily extracted or because they are part of some standard. From a user's perspective, however, other characteristics may be more expressive for accessing information. If such characteristics can be expressed in a language that is employed in the query modeling component of the catalog and if its expressions can be related semantically to those of the meta-information in the document modeling component, it may be of value to employ different languages.
Examples of this approach involve the use of a language in the query modeling component that is ``richer'' from a user's viewpoint than the language in the document modeling component. As a simple example, a thesaurus may be used to translate a relatively broad array of user terms into synonymous, ``canonical'' terms used in the meta-information. A less trivial example involves generalizations of thesauri in which co-occurrences of terms are derived from empirical analyses of documents. The richer set of terms in the model of the user's query may then be interpreted statistically using the more limited terms of the document models.
Our preceding discussion of catalog tools, such as thesauri and the FGDC metadata content standard, indicated how such tools could be viewed in terms of languages. It also also showed, by example, how they could be viewed as possessing associated sets of axioms and associated sets of inference mechanisms. We also noted that in relation to multiplicity of languages, it is frequently important to be able to interpret the expressions of one language in the terms of some other language.
If we regard the linguistic expressions of meta-information relations as a subset of the associated set of axioms, we may view the components of the catalog in terms of various formal systems. The components of any of these formal systems include a language L, a set of axioms A, a set inference rules F, and a set of interpretations I. These interpretations map expressions in one language into the expression of other languages, which may include natural languages, declarative formal languages, and programming languages.
The interpretations of the languages play an important role in multilanguage catalogs. In the matching component, for example, queries expressed partly in one language must be interpreted in terms of another language in order to carry out a semantically meaningful match. In the catalog interoperability component queries from alien catalogs that employ different languages must be interpreted in terms of the languages employed in constructing the document models.
Reasons for analyzing the components of a catalog in terms of formal systems are that, first, it provides a rigorous basis for the analysis of the expressive power of the meta-information and of the computational complexities associated with processing queries in such systems. Second, it provides a framework for designing and extending catalog systems in a modular and systematic manner.
In order to make this view of catalog components
as formal systems clear,
we present an example
showing how the representation of a gazetteer
in the query modeling component of ADL
may be viewed as a formal system.
First, in relation to the language L, as viewed at the logical level,
(1) the constant symbols of the language include
symbols that are employed to denote feature
classes, feature instances, and aspects of the
footprints of the feature instances, such as points,
lines, and polygons.
(2) The function symbols of the language include
symbols that denote a spatial projection operator
and the usual binary set operators (intersection,
union,...).
(3) The relation symbols include the same set
of relation symbols as thesauri, including
``PART_OF''.
Second, the axioms A include not only the axioms
concerning ``ISA'' and ``PART_OF''
relations of a thesaurus, but also the
set of axioms that define the relations
between named feature instances and the footprints,
such as:

in which the second term in the relation represents a polygon defined in terms of n points. Third, the inference rules F may be viewed as the standard inference rules of FOPC. Both the construction and manipulation of meta-information may be viewed as involving the use of these inference rules. Finally, the set of interpretations I may include, for example, mappings from the expressions of L into the expressions of the language associated with the FGDC standard.
For a variety of reasons, including historical reasons, different catalogs can employ a variety of different languages in representing meta-information. Additionally, while the same basic languages may be chosen for different catalogs, their implementations and schema may be different. Both effects can lead to semantic incompatibility. We discuss this issue in further detail below.
The proposed framework is of value in providing a basis for rational catalog design. In particular, it provides a natural means of analyzing and resolving two basic issues that arise in the design of DL catalogs because of the multiplicity of languages employed.
The first issue, which we term the catalog intraoperability issue, concerns the goal of designing and implementing efficient yet expressive catalog systems. A fundamental trade-off that is analyzable within the framework involves (1) the advantages to be gained by a divide-and-conquer approach in which documents and queries are modeled in terms of a set of relatively simple and specialized languages; (2) the disadvantages that arise in multi-language environments because of the necessity of constructing and applying interpretations to ensure semantic compatibility between expressions represented in the different languages. If the choice of languages in the components is determined, then the problem becomes one of making the interactions between the different languages semantically compatible in the most efficient manner. We term this the ``weak version'' of the catalog intraoperability issue.
The second issue, which we term the catalog interoperability issue, is closely related to the first issue. It concerns the goal of making different catalog systems semantically interoperable in as efficient a manner as possible. Since the reason for using different languages in this case is not to construct an ``optimal'' catalog, but rather to provide an efficient solution in the face of uncontrollable language heterogeneity, the problem is to apply multiple interpretations as efficiently as possible.
We note that the weak version of the catalog intraoperability problem is clearly almost identical to the catalog interoperability problem as stated here. It follows that the strategy for solving these problems should be similar. Two distinct strategies include (1) making direct translations between any pair of languages that interact (e.g. through the match component); (2) translating from each language to some intermediary language. Clearly we may combine these strategies in different ways.
We now examine further the issues of
intraoperability and interoperability,
and discuss some of the approaches that have
been taken in ADL with respect to these two sets
of issues.
If a problem involves N languages,
we term the first strategy the ``
strategy''
since in general one must
construct of the order of
translations
of the languages.
We term the second strategy
the ``N'' strategy, since one
must construct of the order of N translations.
In particular, we
discuss examples of both the
and the N approaches
in section 3.2 concerning interoperability.
Examples of specific issues that arise in the strong
version of the catalog intraoperability problem include
(1) deciding which languages to use
in each of the components of the catalog,
and particularly in the document modeling
and the query modeling components;
(2) deciding how to solve the language translation problem
in cases of multiple languages.
Choices in the latter problem
include (1) whether to adopt the the N strategy or the
strategy;
or (2) the choice of whether to perform the appropriate translations
prior to performing the match operations
or whether to embed the translation process in the
the match procedure itself.
In relation to the choice of languages for catalog components, a particularly simple solution involves using a choosing a single language for modeling both documents and queries. A prototype version of ADL employed this approach by using FGDC as the main language in both components. One loses expressiveness, however, if the language is too restrictive, while one pays a heavy computational price if the language is too general (e.g. natural language.) One also loses the power of using multiple languages, as previously discussed.
In relation to the choice of language translation strategy,
the advantages of embedding the translation process
in the match procedure include computational
efficiencies if the translation is procedurally
part of the match. Disadvantages include
the lack of modularity and hence of flexibility
if it is necessary to modify
either the match procedure or the interpretations.
The trade-offs in choosing between the N
and
strategies are relatively self-evident,
and involve the issue of finding a single appropriate
language and the cost of making the translations.
Since such problems are difficult to analyze in the general case, we now provide examples of more restricted problem-solving in relation to the intraoperability problem. In particular, we briefly discuss two problems that arose during the construct of the ADL catalog.
The first problem, which was alluded to above, involved the decision of whether to construct document models for maps containing meta-information about the features represented in them or whether to embed meta-information relations from the gazetteer in the query modeling component. The latter solution was adopted on the grounds of the amount of processing that is required to construct the meta-information models of the maps and the size of the models that would result.
The second problem concerned the construction of the gazetteer in the query modeling component. The basic issue was that two distinct gazetteers were available, namely the Geographic Name Information System (GNIS) from the USGS and the Board of Geographic Names (BGN) from the DMA. Both of these gazetteers contain a list of features, a classification of the features and the spatial projections of the features. The basic problem was that the underlying languages of the two gazetteers, while similar, were sufficiently different to require non-trivial interpretations between their corresponding terms. Hence the issue of semantic compatibility arose.
For example, the GNIS classifies its features into 65 feature types. It also has an implicit concept of ``generics'', which is a subclassification of features. The generics, however, are not proper subsets of the feature types, so a pure hierarchy is difficult to construct. The BGN contains nine feature classes as the highest level of a hierarchy. It also stores an explicit sublevel of the hierarchy in the form of 638 feature types. Unlike the GNIS hierarchy, which is informal, the BGN feature types are proper subsets of the feature classes forming a formal hierarchy. Because the classification schemes for features in the two gazetteers depend on the cognitive categories applied to real world objects (i.e. they are specified at the conceptual level), it is difficult to combine them and maintain this hierarchy. To the extent that the ADL gazetteer uses a hierarchy to facilitate queries, this mismatch between GNIS and BGN was problematic.
One possible solution to the gazetteer language problem was to use both gazetteers as distinct languages. The difficulty with this approach is that their domains of application overlap. Hence significant inefficiencies would result from this approach. An alternative approach was to translate both of them to the terms of a common language (i.e. construct a merged gazetteer). The latter solution was adopted with the most difficult issue relating to the integration of the feature types/classes. The resulting gazetteer is a superset of the two gazetteers and involves a more general classification scheme.
The key problem in the catalog
interoperability issue is
to find an efficient solution
to the problem of interpreting the languages
of one catalog into the expressions
of the languages in another catalog.
While the
and N strategies
provide basic approaches to this problem,
we note that there is a large space of
possibilities to explore.
In adopting the N strategy, for example,
there are many candidates for the intermediary language,
including the language components of Z39.50,
various knowledge representation languages
such as KIF (Genesereth [11]), ONTOLINGUA (Gruber [12]),
and Representational Structures (Smith et. al. [23]).
Furthermore, within Z39.50, it is possible to exchange expressions in other languages, with an indication of the language that is being used for any specific set of expressions.
Since the resolution of such issues requires
a great deal of further investigation,
we limit our discussion to describing
three limited investigations involving
the
and N strategies.
These experiments were intended to determine
the relative costs of the two approaches.
An interoperability experiment was performed by ADL and the digital library being developed at the University of California at Berkeley (UCB) under the NSF/NASA/ARPA Digital Library Initiative (DLI). The goal of the experiment was to match models of queries from the UCB catalog with models of documents from the ADL catalog. For the experiment, both DL's limited their query models and their document models to those based on the FGDC language. Furthermore, both DL's represented the FGDC language in terms of the relational data model and implemented it in terms of the Illustra extended relational DBMS. The main difference was that the two DL's independently constructed different schema for their data models.
As part of the experiment, a direct mapping
was established between the two schemas,
as part of an
approach.
Interconnection was provided
via Illustra's client-server architecture.
While the experiment led to good performance in terms of interoperability, the costs of employing this strategy were significant. It took an expert database programmer over 6 person-weeks to create the schema mapping. Given a realistic DL environment with a large number of catalogs using multiple languages, one must conclude that if these costs are typical, then the approach would scale poorly.
An interoperability experiment between ADL and the Stanford DLI project was established to investigate the N (or intermediary-language) approach. The two DL's implemented their catalogs in different languages, and used different DBMSs in the document modeling components.
Z39.50-BIB-1 was chosen as the intermediary language,
for translationing between the
internal languages of each library.
As in the case of the experiment with UCB,
mapping decisions were still made by hand.
More significantly, the intermediary language
had a limited set of attributes
(over 50 in Z39.50 BIB-1 compared with over 300 in ADL's
internal language).
This restricted both the expressiveness of the intermediary language
and the information that may be shared.
In contrast, the
approach allows libraries
to share any information that both
of their internal languages can represent.
This shortcoming, however, is inherent in the specific
intermediary language chosen,
and not in the N approach itself.
The implementation in this experiment took 2 person-days, and good performance with respect to interoperability was achieved. An advantage of the N strategy is that no further time investment is required, on the part of either ADL or Stanford, should other libraries decide to participate.
Finally we note that ADL is currently investigating the use of a common language for gazetteers type information. This has significant, although limited, implications for catalog interoperability, since significant efficiencies result if a standard language and a standard implementation are used to represent specific classes of meta-information.
The goal of this investigation is to establish a standard approach for representing gazetteers in terms of the language of FGDC. In this context, the gazetteer entries are analogous to the information object (or geographic dataset) for which the FGDC standard was originally developed. This specification will not result in incorporating gazetteer feature information into meta-information for information objects, but is simply a way of standardizing the representation of the information in the gazetteer. Many of the suggested fields are based on fields already included in the FGDC standard. For example, contributor information is required meta-information for geographic datasets in FGDC.
Given that this specification does not further integrate the gazetteer with other meta-information, it does not change the function of the gazetteer with respect to the query modeling, matching, or document modeling components of the ADL model. The decision of whether a gazetteer is maintained separately from the other meta-information is independent of the logical model used to represent the gazetteer.
The FGDC specification for the gazetteer provides a formal system for representing the gazetteer that can be the basis of transferring and translating query requests across libraries in the same way that the FGDC standard can be used as the basis for query interoperability for other information objects.