Report on a Workshop on Metadata
held in Santa Barbara, California
November 8, 1995
Michael F. Goodchild
University of California
Santa Barbara, CA 93106-4060
BACKGROUND
The Alexandria Digital Library (ADL) project was established in 1994, with funding for four years as one of the six projects of the NSF/NASA/ARPA Digital Library Initiative (DLI). Participants in ADL include the Map and Imagery Laboratory, Departments of Computer Science, Computer and Electrical Engineering, and Geography, and the National Center for Geographic Information and Analysis at the University of California, Santa Barbara; NCGIA sites at the State University of New York at Buffalo and the University of Maine; and several corporations, libraries, and agencies. The primary goal of ADL is to design, implement, and deploy a digital library for spatially- indexed information. A digital library supporting such materials is needed because spatially- indexed information is an extremely valuable resource in many applications but is currently costly to access. Many important collections of such information, such as maps, photographs, atlases, and gazetteers, are currently stored only in non- digital form, and collections of considerable size and diversity are found only in the largest research libraries. Although a growing amount of such information is available in digital form, it is still inaccessible to most individuals. ADL will provide a framework for putting these collections online, providing search and access to these collections to broad classes of users, and allowing both collections and users to be distributed throughout the Internet.
Further details of ADL can be found under the project home page, http://alexandria.sdc.ucsb.edu. A Rapid Prototype was completed in March 1995 as a stand-alone example of a digital library for spatially-referenced objects; it is available on CD-ROM. A prototype World Wide Web implementation was demonstrated in November 1995 and is currently undergoing further development prior to public access.
In the terminology of ADL, an information object is an item in a digital collection, the digital library analog of a book or a journal article. There are three types of objects of importance to ADL: spatial objects, such as maps, images, or photographs, which contain representations of the variation of phenomena across the area of space covered by the object; spatially referenced objects, such as books, pieces of music, or video, that can be positioned in space, and for which spatial reference is a useful basis for retrieval; and general objects that lack spatial reference. The primary spatial frame of reference of ADL is the surface of the Earth (the geographic frame), but the same principles can be extended to other spaces, such as the surfaces of planets, the layout of a building, or the human body.
In the traditional library, the card catalog has provided the primary mechanism for search and information discovery, although browsing of shelves and stacks is also sometimes an important process. Development of digital catalogs enabled processes of automated search, including search by author name, and search of titles for key words. In the digital library this process can be generalized further. If all information objects are digital, search need no longer be confined to the contents of the digital catalog, but can be extended to search of the contents of the objects themselves. It is possible to envision processes of automated cataloging, based on computer analysis of each object's contents. At the same time, information about objects in the digital library must be extended to include items needed to ensure successful retrieval and handling of the object for which there maybe no analog in the traditional library.
The term commonly used to refer to such information about digital objects is metadata. Numerous questions about the usefulness of such a term, the purpose of metadata and the processes supported by it led ADL to propose a one-day workshop on metadata to be held immediately prior to the November 1995 meeting of the six DLI projects, their partners and sponsors. The workshop would focus discussion on a topic of interest to all of the DLI projects, while providing useful input to ADL. We invited roughly 30 participants, drawn from the library, digital library, and geographic data communities. A report of the workshop was made to the full DLI meeting the following day, and an article on the workshop's major themes and conclusions appeared in the Chronicle of Higher Education (*).
The workshop had three broad objectives:
* Provide an opportunity for a focussed discussion of metadata among the DLI projects, with particular emphasis on the issues raised by digital libraries that have no analogs in traditional libraries.
* Elicit clear definitions of metadata and its component parts, and its relationships to information granularity, that are applicable over the full range of the DLI projects.
* Examine whether a uniform, comprehensive approach to metadata is possible that spans all information types; and if not, identify domains of information and associated approaches to metadata definition.
In addition, efforts were made to embed the workshop discussion within the general theme of the larger DLI meeting, namely interoperability between the six DLI projects.
The workshop was chaired by Michael Goodchild, Associate Director of ADL. It opened with a presentation by Terence Smith, Director of ADL, which included a discussion of the use of R-Structures, the fundamental primitive of the Computational Modeling System (CMS; *) as a top-down approach to metadata. The remainder of the morning was taken up by general discussion. The afternoon included a presentation by Michael Goodchild on issues of spatial metadata; an informal presentation by Stuart Weibel of the Online Computer Library Center Inc. (OCLC) on the Dublin Core approach; open discussion; and smaller breakout discussions.
This report provides a summary of the workshop discussion. The following section pulls together the various points made during the workshop on the nature of metadata, associated requirements, and the processes and services it must be designed to support. The third section examines underlying trends, particularly in digital catalogs and digital libraries, that are affecting the nature of metadata. The final section reviews the various solutions proposed during the workshop, and the tensions identified during the day's discussions, and revisits the workshop's objectives. The appendices include a list of participants, and the short papers written prior to the workshop by several of the participants.
WHAT IS METADATA?
Requirements
The analogy to the traditional card catalog provides an obvious basis for defining metadata, but one that fails to take into account the opportunities and problems associated with a digital library. First, the traditional card catalog is built on a rigid structure of information granularity, since all information objects in the traditional library are individually bound books. In the case of serials, the article provides an information granularity below the level of the bound volume, but it is nevertheless equally rigid. In the digital world, granularity of information may take on entirely new and unfamiliar forms that are no longer linked to the granularity established by the author or publisher. This is especially true for spatial objects: one might want to merge individual maps into a seamless view of the Earth's surface, or to separate the several layers of information shown on a single map into several separate objects.
Second, as noted earlier the digital world provides the opportunity to blur the distinction between data and metadata. Instead of relying entirely on the prescience of the cataloger, a digital library might allow direct search of the contents of an information object, particularly for domain-specific information or minor detail that a cataloger might normally ignore.
In his opening presentation, Terence Smith drew attention to a definition of metadata devised by Francis Bretherton: metadata is "information that makes data useful". Such metadata might include three subsets: information necessary for the successful handling and use of the data, such as its format or location; information that generalizes or abstracts the data to identify salient characteristics of its content, and thus supports the functions of search and browse; and information typically not available in the content that affects usefulness, such as information on quality. In the digital library, the first and third are in principle content-independent, not obtainable from the content; the second is content-dependent, but often the generalization and abstraction function is sufficiently complex that it cannot be automated.
Bretherton's definition emphasizes the view that metadata is not solely an invariant characteristic of the object, but also must adapt to the characteristics of the user. The assessment of an object's "fitness for use" must reflect both the object and the user, and may be unique to the combination.
Smith proposed eleven requirements of metadata for digital libraries:
* Supports content-based search.
* Supports user-centered views.
* Provides near-transparent access to traditional as well as digital libraries.
* Communicates its semantics to other catalogs in support of catalog interoperability.
* Models information in a common conceptual framework.
* Supports different levels of generalization and abstraction.
* Possesses the expressive power to permit representation of the structure, content, and context of objects.
* Is easily extensible.
* Its conceptual framework supports the creation of new concepts for objects and their contents.
* Supports concept development by users.
* Possesses sufficient computational efficiency to support interoperability.
A disciplined, requirements-based approach would make interoperability of digital library catalogs much easier to achieve.
Several other requirements were suggested. Maria Zemankova (NSF) proposed that automated generation of metadata be a requirement, given its importance to digital libraries. While some elements of metadata might be derived automatically through analysis of content, or inserted by the software that creates the object, it is clear that this cannot apply to all elements, such as those related to successful handling of the object.
A simple model of metadata
Jim Frew (Alexandria Digital Library, UC Santa Barbara) proposed a three-part model of metadata as a basis for discussion. The contents of digital metadata should address three dimensions: structure, defined as the information needed to understand the arrangement of bits and bytes in the object; context, defined as the meaning of the bits and bytes in more abstract, conceptual terms, and the general properties of the object that describe its relationship to the broad domain of other objects; and content, defined as the specific detail contained in the object. For example, a remote sensing scene might be classifed as follows:
* structure: the file format;
* context: AVHRR image (ID) collected on (date, time);
* content: has 80% cloud cover, shows Santa Barbara County and the Marre fire.
To find the image, we normally rely on structure and context; to evaluate the image both content and context are needed; and to use the image we require structure and content.
An important issue arises over content in cataloging imagery such as this. Should the content include information on cloud cover only if that information has already been computed and stored explicitly, or should it be included because it is possible for any user to compute cloud cover in a reasonably unambiguous way? This is a very important issue for imagery, because the possible interpretations that can be placed on images, either by computation or by human interpretation, are often very complex and sophisticated, and yet essential to the data's effective use.
Metadata for whom?
One of the most persuasive arguments for digital libraries is their potential to increase access to library contents. Today, research libraries are largely confined to university campuses, with a few notable exceptions, and designed primarily to meet the needs of students and faculty. A high level of user expertise is assumed, in understanding the basis of cataloging, navigating the various components of the library system, and conducting productive search.
Digital libraries will be available through wide area networks, which will offer the potential for extended access both geographically and to new groups of users. Unlike traditional catalogs, digital technology can be exploited to provide customized user interfaces that react to the skills and expertise of different classes of users. Thus the interface offered to a politician might be very different from that presented to a 6-year-old student.
In practice, libraries accommodate special users with different levels of expertise by providing human assistants. The process of finding a book becomes a conversation between user and assistant, in which both search for a common language. From this perspective, metadata in the digital library might also be seen as a dialog, as the system and user search for a level of description that is meaningful to both.
Metadata domains
Several distinct communities were represented at the workshop, including librarians, GIS specialists, and specialists in digital libraries. Each community's views of metadata issues are to some degree colored by the community's traditions, and legacies of earlier technologies. Thus GIS specialists tend to see metadata as an extension of data set documentation, and librarians may view metadata as an extension of conventional catalogs. Each community has a tendency to see metadata as a problem the community must solve, and recognizes a tension between seeking a solution for the community, and linking its discussions to those of other communities. Unfortunately, the dividing lines between communities may not reflect the optimum divisions for development of metadata.
The GIS community, for example, focuses on spatial objects, and places great importance on geographic elements of metadata such as map projection, spatial resolution, and the geographic extent of the data. To the library community, with its emphasis on bound volumes of text, geographic extent is a comparatively minor element and defined only in limited cases. Are the problems of cataloging musical scores sufficiently unique to justify a self-contained domain of metadata development, or should they be integrated with other types of information objects in the digital library?
The workshop participants agreed that traditional cataloging domains have emphasized the physical nature of information objects, often their physical media (video, music, maps, photographs) but that media are changing rapidly as we transition to a digital world. Various niches will emerge in which metadata efforts are comparatively successful, and issues of compatibility will then have to be addressed. Future domains of metadata may be defined by access conditions--user-pay access may support more extensive metadata than open access, for example. For now, discussion of metadata within the GIS community makes sense, but it is important to link it as much as possible to broader discussions.
Abstract models of metadata
Content-dependent metadata can be viewed as a set of interpretations supporting search--as mappings from elements of the structure, content, and context of objects that are analogous to the traditional subject, author, and title. Some interpretations may be precomputed, but other elements of metadata might correspond to procedures applied on-the-fly, such as methods of pattern recognition applied to a collection of images by a user searching for objects containing particular features. Content-dependent metadata is often domain-independent, but domain-dependent metadata may also be important in particular applications.
Smith emphasized the significance of concatenated mappings. An aerial photograph might be abstracted to a metadata footprint, indicating the area of coverage; a footprint might map to a universal object ID. By concatenating the two mappings, we can identify an object ID with the photograph's contents. Such mappings work in both directions. In this context metadata might be defined as "admissable concatenations of interpretations starting or ending with an object ID". If mappings are modular, it is easy to add new mappings, and thus the approach is extensible.
For example, the traditional card catalog can be seen as a set of concatenated mappings from the contents of a shelved volume, through the author, title, subject, and assorted annotations of the catalog card, to the unique object ID provided by the book's ISBN or its Library of Congress call number.
This conceptual section of Smith's presentation ended with a proposed framework for metadata. The librarian constructs the mappings that constitute metadata, while the user is able to examine a particular view of metadata which may depend on the user's domain of interest, or level of expertise. The framework has similarities to the distinction between logical and conceptual levels of data modeling. A satisfactory metadata framework must be able to represent any concept that may be used to describe any aspect of an information object's structure, content, or context.
Central to the proposed framework is the formal concept of an R-structure, or representational structure, a very general formalization that can be applied to any form of human knowledge. At the core of an R-structure are three concepts. First, the domain defines the structure's syntax; for example, the domain of the structure polygon is an ordered sequence of coordinate pairs representing the polygon's vertices. Second, transformations define the operations that are valid on a given domain, and thus give meaning to its elements. For example, the transformations for the domain "polygon" include "area", "centroid", and "circumference". Finally, instances of a domain define actual polygons. R-structures readily accommodate hierarchies, and thus the different levels of abstraction that are inherent in metadata as defined earlier.
A catalog can be viewed as a large set of R-structures, representing the structure, content, and context of each object. R-structures can readily accommodate both digital and non-digital objects. Finally, the catalog itself can be viewed as a single R-structure, with associated admissible transformations.
Smith showed how it is possible to place the answering of user queries within this conceptual framework. Each query is first translated from a forms-based or natural language interface into available R-structures. In some cases the match to R-structures will be straightforward, but in other cases there may be a range of interpretations to be resolved. A team of librarians might focus on one domain of applications, define classes of queries, design associated R-structures, and design appropriate graphical interfaces.
Several points were raised in response to Smith's presentation. Did the framework of R-structures require objects to have unique IDs, and thus fixed granularity? Did it assume the existence of some group who would determine the admissibility of a representation? How would it be possible to achieve agreement on the definition and meaning of a concept? Do query languages exist to handle the R-structure framework?
While R-structures provide a consistent framework, some of the workshop participants felt that the process of library search was inherently unstructured, and that the important task was therefore not to seek ways of forcing it into a structure in the digital world, but of supporting its essential fuzziness. For example, the concept "urban blight" is not well defined. But enough agreement exists on its meaning between producers and users of information to make it a useful concept. In the process of searching for information the librarian is in a sense an intermediary or broker, helping to identify terms that are sufficiently well defined to support useful search, but not attempting to synthesize more rigorous definitions or to impose them on producers or users. The effort to find a unified view through R-structures was laudable, but was it practical in the near term? Was a theoretical framework also needed for the less structured aspects of library search?
UNDERLYING TRENDS
The six DLI projects are being conducted in a context of very rapid technological change. Indeed, the anticipation of continued rapid change is implicit in DLI, since this research will only lead to practical application in real digital libraries if computing power, storage, and communication bandwidths continue to improve at least at the currently observed rates.
In his opening presentation Terence Smith drew attention to many of the underlying trends that are driving digital library development, and that provide context for this discussion of metadata:
* The volume of data available in digital form continues to increase rapidly. In the specific area of digital geographic data, we anticipate that terabyte databases will become common when the new generation of earth observation satellites comes on line in the next few years.
* Search based on content is becoming increasingly feasible due to continued increases in computing power and the potential to parallelize many algorithms.
* The volume of metadata and catalog information is growing rapidly, thanks in part to the development of tools for automated metadata extraction, which in turn are driving down the costs of cataloging. The initiative taken by the Federal Geographic Data Committee (FGDC) in promoting its Content Standards for Digital Geospatial Metadata (http://GeoChange.er.USGS.gov/pub/tools/metadata/standard/metadata.html) has also stimulated much production of metadata, particularly among federal agencies, as has the discussion of ISO Technical Committee 287.
* New software technologies are creating better opportunities for the assessment of information, mark-up and annotation, informal linkage of information, and capturing of user views.
* There is a proliferation in the number of formats and standards for metadata, and an increase in their complexity, making interoperability a greater imperative.
In addition to these, discussion during the workshop identified several other trends:
* A general expansion is occurring in the role of libraries. Expectations regarding access are rising; libraries are being expected to provide access to new types of information, in new forms; and they are being asked to take on the role formerly assumed by data centers. But this expansion is occurring at a time of static or diminishing resources. More and more information is being accessed without the intermediation of library staff.
* New technology is making it possible for anyone to be a publisher. This is making the process of collection-building ever more complex and challenging.
* The development of technologies for continuous monitoring, real-time data acquisition, and rapid electronic publication is putting great pressure on traditional library functions.
This analysis of trends led the workshop to consider the general question: "has the problem changed as a result of the transition to digital libraries?" On the one hand, the view was expressed that the digital world extended library services to a broader cross-section of people. It opened the question of granularity, since it was no longer possible to sustain a simple, uniform concept of information object and a one-to-one correspondence between object and catalog record. Finally, it raised issues that had not been solved for geospatial data in the pre-digital era, such as how to catalog maps and images consistently.
If fundamental changes are indeed occurring, then the group should consider how best to move quickly to address them. A Darwinist metaphor was suggested--efforts should be made to stimulate change, and to accommodate new ideas, while the community as a whole would determine which was the most successful. The information communities that appeared to have made the digital transition most easily--medicine, genetics, law--perhaps did so because the community already possessed a shared and to a large extent controlled vocabulary. Would recent efforts at formalizing geographic data models, and notably those of the Open GIS Consortium (http://ogis.org), succeed in creating a more controlled vocabulary for geospatial data, how long would this process take, and how could it be speeded?
On the other hand, several workshop participants felt that all of the issues being raised in the context of metadata for digital libraries were familiar concerns of the cataloging community. While certain emphases might have changed, there were no issues of metadata that the cataloging community was not equipped to handle by adapting its methods appropriately. The operational characteristics of digital libraries were different from those of conventional libraries, but the functional characteristics were essentially the same.
Scientific data sharing
There is a high level of interest in digital library research among the scientific community, driven by the changing nature of science and the increasing need to share data, particularly in the environmental sciences. Science is increasingly multidisciplinary, as we come to recognize that many of the harder problems associated with understanding and managing the earth's environment require study of the linkages between many interacting processes. It is also increasingly data-dependent, because studies at global scales can involve literally terabytes of data. Finally, because of the expense of data collection, particularly satellite-based observation, it is essential that the value of data be realized through its use in as many ways as possible. Successful sharing of data between investigators and across disciplines requires effective approaches to data description and cataloging, and thus to metadata.
On the other hand, the needs of the scientific community for data sharing do not provide a perfect fit to the library metaphor. Metadata is more likely to be defined by the data's producer than by a specialist librarian; as noted earlier, it is likely to serve many purposes in addition to those traditionally associated with library cataloging; it must accommodate to new technologies, such as GIS and image processing; its standards are likely to be defined in a bottom-up fashion by the user community rather than top-down by the library community; and there is likely to be heavy emphasis on the use of metadata for search.
Clifford Lynch (UC Office of the President) drew attention to the widespread call for interoperability, and noted its implications for the concept of metadata. An important function of metadata is to permit sharing of scientific information between potentially incompatible computing systems. Metadata plays an essential function in the drive to make systems interoperate, by providing a digital description of the information object to the receiving system. But various levels of interoperability can be defined, depending on the level to which the receiving system understands the incoming object's contents. In a GIS context, interoperability can mean no more than the ability to transfer the geometry of points, lines, and areas; a higher level of sophistication is needed to transfer the attributes of such objects, and the relationships between them. However full interoperability will only be achieved if the meaning of the object's contents is clear to the receiving system. The Open GIS Consortium (http://ogis.org) is directing its efforts to the achievement of full semantic interoperability in GIS.
The Dublin Core
Stuart Weibel (OCLC) presented a brief review of the outcome of a workshop sponsored in March 1995 by OCLC and the National Center for Supercomputing Applications (NCSA) to attempt to determine essential elements for information object description in a networked environment (http://www.oclc.org:5046/oclc/research/conferences/metadata/metadata.html). Thirteen elements were identified that have since become known as the Dublin Core. A second workshop was held in April 1996 in the UK.
Several ground rules were established for the Dublin Core. First, it should emphasize the elements that authors would assign. Its domain was defined as objects that are "document-like", though not necessarily thought of as documents in the traditional sense. It should emphasize the resource discovery objective of metadata, and not address other issues such as object handling. Finally, it should avoid concern for the detailed syntax of description.
The thirteen elements of the Dublin Core are all optional, and the Core is extensible to meet the USMARC, FGDC, and other metadata element sets. The elements include subject, title, author, publisher, other agents, data, object type, form, identifier, relation to other object(s), source, coverage, and language. Coverage was defined to include geographic extent, but the Core does not include level of geographic detail or its surrogates (e.g., scale) in the basic set.
SYNTHESIS
Several broad themes emerged from the discussion, and were used as the basis for break-out discussions in the afternoon of the workshop. The conclusions of each group are briefly summarized below. In addition, a presentation on the workshop's conclusions was made to the Digital Library Initiative plenary session the following day, and that presentation is also summarized here.
Fundamental changes
The workshop concluded that digital libraries present no fundamentally different problems of cataloging and information access that have not been of concern to the cataloging community in the past. On the other hand, it was agreed that certain changes were of sufficient importance to require examination and research, since they would affect the nature of metadata in the digital library. There was consensus that changes such as the following would have a significant impact:
* The emergence of new information types, and new packaging for old information types. The transition to digital information removes many of the barriers that previously defined cataloging domains, and opens new opportunities for integration of data from multiple media, and data that have not been part of the traditional mainstream.
* The need to merge or fuse data from different sources. In the digital world we are likely to be confronted frequently with multiple sources of the same or similar information. In GIS, for example, it will frequently be useful to be able to integrate data from different scenes, taken at different times, with different levels of resolution. In this and other ways the digital library is likely to offer the prospect of integration of data--an opportunity that did not exist in traditional libraries. We do not yet understand the implications of this potential for data description.
* The ability to transform, analyze, and view data in different ways. Similarly to the previous point, digital libraries open opportunities for presenting information objects in many ways, and for transforming content to meet various objectives. Effective metadata must anticipate these opportunities, which are much more extensive than existed for traditional libraries.
Participants also addressed the question "is metadata different from data?"--in other words, will the distinction between information object and catalog record which has worked so well in the traditional library survive the digital transition? In the short term the answer is clearly "yes", if only because early users of digital libraries will need to find their way through these systems using familiar landmarks. But in the longer term, the answer appears to be "no". Metadata is an abstraction of data, and abstraction can be seen as a continuum ranging from the object itself to its ID; the appropriate point along this continuum will depend on the expertise of the user, and other aspects of search and browse. If we follow the card catalog metaphor, metadata should be served separately from data; but the opportunity for content-based search may render this distinction obsolete. In the traditional library the card catalog and the shelves have often been serviced by different staffs, and the production roles are distinct in other ways as well, but this may no longer be valid in the digital world.
Domains
The group concluded that fundamental changes are occurring in the social and economic context within which libraries operate. While it is easy to see the transition to digital technology as the overriding source of change, in reality this transition is embedded within much more sweeping changes occurring in an increasingly information-based society. Information is now a major basis for economies, and a major source of wealth; and the factors that determine the value of information, and the addition of value, are still far from clear. Digital libraries should be proactive in helping to clarify the nature of the emerging information society.
In this environment the creation of communities of interest is very dynamic. Communities are defined by common languages, common problems, and shared knowledge. The emerging domains of digital library metadata are often artifacts of prior technologies, and are often constrained by the physical locations of repositories. Thus, for example, metadata for video may emerge from large video repositories, and the same may be true for maps. Other domains may reflect shared levels of expertise--vocabularies and standards may emerge from the geospatial data production community (e.g., the FGDC standard), or from particular disciplines in the scientific user community, or from distinct groups in the education community.
Adding to the challenge of metadata definition are the uncertainties over different access scenarios. As commercial applications of the Internet develop, metadata may have to accommodate to variations in intellectual property rights, security, payment for use, and addition of value.
Users of traditional libraries are typically looking for "a book on..." Users of digital libraries expect to be able to look for "information on...", and to be able to process and analyze the information as well. This opens a host of new opportunities for librarians and computers to add value to information by establishing linkages and cross-references, by automatic metadata generation, and through more sophisticated concepts of "data mining" and pattern recognition. In the long run, the issues of concern regarding metadata, and the associated domains, are thus likely to be very different from those we perceive today as we struggle to replicate traditional ways of doing things in digital worlds, and to begin to take advantage of the opportunities they offer.
Interoperability
The third break-out group discussed issues of interoperability. Support for interoperability in a client-server environment requires a minimum set of metadata elements that provide sufficient information to allow the client to input information from the server. At a more sophisticated level, the client might have access to a richer description of the data including semantics. At the most sophisticated level, the client would be able to provide multiple views of the data.
Interoperability should be interpreted through the services supported--at minimum, a digital library context requires support for search, and for response by the client. In a six month timeframe, it should be possible to agree on a framework for discussion of interoperability, a small set of metadata elements, and the return of certain results. But it is very difficult to speculate on what interoperability might mean in a longer timeframe.
The group felt that the Dublin Core represented a useful first step, and its extensibility is an attractive feature allowing it to grow incrementally. It is also important to note that digital environments provide the option of iterative exchange--an initial transfer that is incomplete or ambiguous can always be followed by a request for more detailed information.
Report to the DLI meeting
Michael Goodchild made a short presentation to the main Digital Library Initiative meeting on the outcomes of the workshop. He began by reviewing the approach to metadata currently being taken by the Alexandria Digital Library (ADL), as an example of digital library approaches.
The ADL approach includes approximately 30 elements, selected from the much more elaborate FGDC standard. The elements include the two latitudes and two longitudes of a bounding rectangle defining the object's extent, its data, format, URL, etc. The elements are expressed in a format that is compatible with USMARC, the digital catalog exchange standard of the US library community. Various experiments in interoperability are under way between this ADL design and other comparable projects, primarily involving cross-walks between the appropriate pairs of fields. To date, however, such experiments have been limited to other projects in which geospatial data is dominant. However, ADL is anxious to find ways of achieving interoperability with other catalogs, and to embed its efforts within larger contexts.
Goodchild then reviewed the six questions posed to the workshop participants, and summarized the consensus on each:
* Is it possible to devise an approach to metadata that spans all information objects, and is valid for DLI as a whole?
The Dublin Core identifies a lowest common denominator of metadata elements. From the perspective of ADL, however, and geospatial data libraries in general, it is necessary to extend the Core to include a minimal set of elements needed to support geographic search.
* Can we devise a clear definition of metadata, its component parts and functions, and its relationship to information granularity, that is robust and theoretically sound? What theoretical frameworks already exist?
Several working definitions were discussed at the workshop, each with merits. R-structures provide a robust and theoretically sound framework, including the essential element of a hierarchy of abstraction levels.
* If necessary, can we identify domains of information and digital library function, and associated approaches to metadata definition.
The consensus of the group was that domains exist in part as legacies of earlier technologies; that sound bases for domains can be identified in the short to medium term, but that in the long term digital libraries offer the opportunity to integrate rather than to segment information.
* How do the information objects of special interest to ADL fit within broader frameworks? Can the progress already made on defining metadata for such objects be made more broadly compatible and interoperable?
The workshop concluded that this issue is best addressed by building a convergence between the current ADL metadata implementation and the Dublin Core.
* Does a digital library, with its associated concepts of information granularity and content-based search, require an entirely novel approach to metadata for these objects? Are there elements of such a novel approach that may help bridge the distinctions between classes of information objects?
Yes, in the long term metadata in digital libraries is likely to bear little resemblence to the traditional card catalog, since it will evolve to support a hierarchy of different levels of abstraction, and different degrees of expertise on the part of the user. Moreover, the traditional concept of the map sheet or image scene, which currently dominates the granularity of information in GIS databases, is likely to be replaced by geographical seamlessness.
* What mechanisms exist for promoting the results of DLI research on metadata within the broader community?
The group felt that integration with efforts such as the Dublin Core provided excellent mechanisms for reaching the broader community.
Tensions
Goodchild ended by describing three fundamental tensions that emerged from the workshop discussion and that appear in various forms in the earlier sections of this report.
1. The frameworks to address these issues already exist in the cataloging community.
vs
Fundamental changes are on the way that will require new approaches.
2. For geospatial data, new concepts of information granularity, the blurring of traditional domain boundaries, and the need to deal with new classes of use and users will require novel approaches.
vs
Numerous niches are being exploited; their survival will be determined.
3. Interoperability requires a more structured approach, and a finer level of granularity, than currently exists, in which library, data producer, and data user share a rigorous vocabulary.
vs
We must devise limited objectives that reflect the inherently fuzzy nature of library search and resource discovery.
Position Papers
Each participant was invited to prepare a short position paper in advance of the workshop. The submitted papers are reproduced below.
We propose to structure the day in two parts, around two sets of questions. The morning will be opened by Terry Smith, Director of ADL, and will focus on the following questions:
1. Is it possible to devise an approach to metadata that spans all information objects, and is valid for DLI as a whole?
I believe this is difficult at best. I think of meta-data as contextual information which helps define and find information. There can be clear instances of meta-data as in parameters which support query optimization for document retrieval, but very often the distinction is based upon a user's perspective. Spatial characteristics may be meta-data or data. Also these distinctions will change over time as well.
2. Can we devise a clear definition of metadata, its component parts and functions, and its relationship to information granularity, that is robust and theoretically sound? What theoretical frameworks already exist?
I believe that this problem is approachable only within classes of data objects and defined access methods. See below. But even so, in our experience, we are processing document information to be accessed with a variety of interfaces and each requires distinctive metadata. And more and more access methods are coming.
3. If necessary, can we identify domains of information and digital library function, and associated approaches to metadata definition?
I do think that this is the correct approach--that metdata can be usefully defined for particular contexts of information and access methods. This is almost a matrix but I'm not sure what's in the axes.
4. How do the information objects of special interest to Alexandria fit within broader frameworks? Can the progress already made on defining metadata for such objects be made more broadly compatible and interoperable?
As in the above, there is a continuum of spatially indexed objects to spatial data (e.g., coordinates describing the project boundary of an EIR vs. the spatial data upon which the report's conclusions are based). But also spatial qualities have a similar structure which argues for a common approach. e.g., all spatial data require definition of coordinate and projection information; also many spatial objects have a structure which would generate a common schema, e.g., all air photo schemas will relate flight line, photo, photo center coordinates, altitude/scale, lens parameters, etc.
5. Does a digital library, with its associated concepts of information granularity and content-based search, require an entirely novel approach to metadata for these objects? Are there elements of such a novel approach that might help bridge the distinctions between classes of information objects?
Yes. Meta-data and access methods are linked. In particular some approaches atomize spatial data in very structured way (OGIS) which require peculiar meta-data to organize and optimize.
6. What mechanisms exist for promoting the results of DLI research on metadata within the broader community?
Results!--particularly interoperability.
The Alexandria project is focused primarily on spatially referenced objects, defined as information objects embedded in a spatial frame of reference, and therefore potentially addressable via that frame. For most purposes, spatial can be equated to geographic, and certainly the initial focus of Alexandria is on information objects that are either geographic in content (each item of information contained in the object is referenced to a location on the Earth's surface) or geographically referenced (for example, text objects may be associated with some more or less well-defined "footprint" on the Earth's surface). For such objects, the geographic frame provides a rich and powerful means for finding the object, but for various reasons this frame has not been exploited in traditional libraries.
Metadata is usually defined somewhat loosely as "data about data". It addresses such questions as "what would one want to know about an information object to make it easier to 1) assess its usefulness, 2) find it, and 3) retrieve it. The traditional library card catalog provides a convenient metaphor for metadata; unfortunately, the metaphor is so familiar and convenient that it is difficult to think beyond it, and to ask how one might best address questions of access and retrieval in a digital library. Moreover, the first generation of digital libraries will have to preserve familiar metaphors if they are to be accepted as useful by the community, and will have to provide effective access paths to non-digital information objects as well as digital ones.
It is easy to see how metadata must address issues in a digital library that have only minimal equivalents in a conventional card catalog, such as data format, or an object's path or URL. At a higher level one can see how metadata should support unfamiliar forms of search that become possible in a digital environment, such as search by geographic location. But at the highest level it is difficult to address the more general question: What kinds of "data about data" are needed to deliver the full potential of the digital library?
Two issues seem paramount in devising a new approach to metadata. The first is that the card catalog, the familiar metaphor, is "flat"--all objects in the traditional library are instances of a single class, the book. Cataloging works well for instances of other classes such as records or musical scores where there is a similarly rigid concept of granularity. It works much less well in cases where rigid granularity breaks down, as with a photograph in a book, or a plate in an atlas. Such instances are already common in the digital world, where the physical granularity of storage medium (the volume) likely has little to do with the logical granularity of information. In a digital world a user should be able to access information at many levels in a hierarchy of organization; it should be as easy to find a map embedded in a CD encyclopedia as it is to find the CD.
Geographic information objects present particular problems of granularity that are reflected in debates over geographic data models. At the most primitive level, a geographic information object contains either a collection of digital representations of discrete features of homogeneous topological dimension (points, lines, or areas) or a digital representation of a field. All more complex objects are aggregations of these basic primitives, but the level of aggregation can be high, even for a simple topographic map. Unfortunately, then, the digital world opens far more options with respect to granularity than did the analog world of paper maps and images.
Digital metadata must include instructions needed for handling the information object, as well as information which is true of all of the contents of the object uniformly, such as author, or publisher, or subject. One can also see content-based searches resolved at the level of metadata if the concept is generalized to include other forms of data abstraction, such as information on particularly important items within the object, or items of interest in particular domains.
The second major issue arises when metadata is defined as data needed to help the user assess the object's fitness for some specific use. Now metadata is no longer a function of the object alone, but of the object, the user's understanding of the domain, and the intended use. The card catalog metaphor, which maps an information object into a single record intended for all users and uses, is clearly inappropriate, as are proposed metadata standards that rely heavily on this metaphor.
Suppose, for example, that I am planning a trip to the Huntingdon Library in San Marino, a suburb of Los Angeles, from Santa Barbara. With no understanding of the domain, I might request data on locations of libraries, streets, and highways within a large area of Southern California. With a modest understanding of the domain, I might transform this task into a request for a geographic information object created from a map at an appropriate scale, knowing that the necessary highways, streets, and location of the Huntingdon Library will likely be included in such objects. With a higher level of understanding, I might request the Rand McNally Road Atlas.
In reality, the approaches to metadata being developed for geographically referenced objects seem heavily compromised with respect to these two issues. First, they preserve a concept of information granularity that is inherited from traditional maps, atlases, images, books, and photographs, and is thus unable to deal with more elegant concepts of hierarchical organization of information, and geographical seamlessness. Second, they force a single level of domain understanding, with the result that information is inaccessible to those users whose knowledge of the domain is too low; and frustrating to those whose knowledge is too high.
I would like to discuss two models that address these issues at least partially. In the first, metadata is generated both by the library, as data about each information object, and by the user, as a representation of the information that would be ideally suited to his or her particular use. Search then becomes a problem of finding the optimum match between the user's metadata, and that of each information object in the library. The model assumes that suitable metrics have been defined, with or without user involvement, to structure the search space.
The second model defines metadata as a hierarchy of abstraction. At the lowest level are the most mechanical and detailed descriptions, likely fully understandable only by data producers and publishers, and linked to a low level of information granularity, such as a single class of items. Above this are a range of higher levels of abstraction, whose effective use requires a more and more elaborate sharing of language and experience between data custodian and user. At higher levels descriptions may span larger and larger numbers of primitive items, such as the Rand McNally Road Atlas, or the USGS National Topographic Series. Such keys are very efficient within an appropriately trained community, but meaningless to others.
It seems clear that information at higher levels of abstraction cannot be structured or codified in any effective way. However it does seem possible to structure the linkages between it and the more structured information at lower levels of abstraction.
1. Is it possible to devise an approach to metadata that spans all information objects, and is valid for DLI as a whole?
My basic theory of metadata is that it is intended to serve three functions: identification of potentially useful or relevant information on the network, based on some keyword, subject, or other query constraint; evaluation of found information to determine appropriateness or suitability, based on additional factors such as scope, detail, currency, etc.; and interpretation of selected information based on correct knowledge of format, representation, numeric parameters, and a description of semantic content. I think this basic layered model is applicable to any type of information object, but any attempt to codify specific query elements or attributes that spans information domains or communities is unlikely to succeed.
2. Can we devise a clear definition of metadata, its component parts and functions, and its relationship to information granularity, that is robust and theoretically sound? What theoretical frameworks already exist?
This is the crux of the problem--one person's data is another's metadata. It is fair to say a set of information is only "metadata" in some particular context, and is simply a convenient handle to group such a set of information into a package for searching, browsing, and operation. When issues of data granularity are considered, this becomes even more apparent--information about a collection can be an abstraction or summary of elements within the collection or can be universally applicable across the collection. So the required functions for metadata include everything from documentation of individual data elements to catalog keywords.
3. If necessary, can we identify domains of information and digital library function, and associated approaches to metadata definition?
There is no alternative. The open question is how broad or narrow the domains need be. Is there a geographic domain, or FGDC plus DIGEST plus ??? domains, or environmental resource management plus transportation plus demographic plus ??? domains, and so on. The domains of information are not even hierarchical--instead there are information communities who share some definitions, aggregate or disaggregate others, and redefine still others. It becomes a web of metadata types and definitions.
4. How do the information objects of special interest to Alexandria fit within broader frameworks? Can the progress already made on defining metadata for such objects be made more broadly compatible and interoperable?
Geospatial metadata as defined by FGDC and used in Alexandria and other initiatives has already been defined as insufficient for some domains of geographic information, such as complex satellite sensor data structured by orbital coordinate systems. It is clear that any definition must be both exensible in and of itself, and interoperable with other metadata dictionaries. But this cannot be done using the global metadata schema in the sky (as has been established by numerous global schema efforts), nor by explicitly defining mappings between every schema. Research needs to be done to determine a method for publishing a collection's or community's metadata schema and underlying definitions that can be summarily used by a naive browser and which can somehow inform a more knowledgable browser with the metadata content, organization, semantics, and definitions.
5. Does a digital library, with its associated concepts of information granularity and content-based search, require an entirely novel approach to metadata for these objects? Are there elements of such a novel approach that might help bridge the distinctions between classes of information objects?
First, we should work against the strict delineation of data from metadata. Instead, attributes of information objects should include all types/methods necessary to completely understand and interpret the object. Then there must be approaches to aggregating attributes as information objects are grouped into classes sharing certain attribute elements (i.e., the attribute is uniform across the collection). This might be strictly defined in the metadata dictionary for a community, but ideally would somehow be inferred from the information content. Otherwise, we just have a straightforward relational schema that works at only a single level of information collection.
6. What mechanisms exist for promoting the results of DLI research on metadata within the broader community?
The main point is to ensure involvement of CS, IT, and GIS researchers and allow cross-pollenation of ideas.
7. Should metadata be defined as being bound to a collection, or is the information a special type of attribute on each element of a collection that may be implemented as collection or object metadata?
A focussed technical question that may raise general issues: Is the spatial reference system (coordinate system, datum, projection, etc.) of a geospatial feature or coverage an attribute of the geometry that should be tightly coupled to the object, or is it metadata about a collection of one or more objects?
Background: spatially indexed information at MBARI
The Monterey Bay Aquarium Research Institute (MBARI) is a multidisciplinary, ocean research institute with a high technology emphasis in ocean platforms (Remotely Operated Vehicles, Moorings, Autonomous Underwater Vehicles), instrumentation and sensors (in-situ nitrate sensors, broadcast quality video, deep ocean video cameras), and integrated systems (data collection, real time data telemetry, data management, data analysis and visualization).
A prime element of the MBARI mission is technology innovation in support of ocean science objectives and the smooth transfer of technology into operational use. The interdisciplinary nature of MBARI's research, conducted in a dynamic technological environment, creates a challenging data management problem designed to support contemporary and future access to scientific information artifacts.
Our scientific information is encapsulated in all classes of data: numbers, text, sets of numbers, sound, image, and video. In addition, there is an information hierarchy, each with a new level of added value interpretation. We characterize these levels as measurements, observations, interpretations, and knowledge. Information artifacts in each of these classes must be "linked" to any source artifacts and transformation (i.e., interpretive) processes that were used in their creation. We refer to this as "data lineage". Essentially all the data collected and used by scientists at MBARI require spatial indexing.
For example, our video data resource is useless on its own. Researchers use these data to determine descriptive and quantitative attributes of the environment (visibility, particulate distribution, etc.) and of objects in the environment (organisms, geologic features, etc.). Integration of the scientific information extracted from video; associated spatial/temporal data; and collocated measurements of physical, chemical, biological, and geological properties allow the researcher to determine patterns of abundance and to focus on processes that determine observed patterns.
To date, MBARI's video data stream has originated from 35 different chief scientists on over 700 ROV dives. Over 4000 video tapes have been viewed and annotated, spanning over 2000 hours of video data collection time. Intense efforts of several individuals have resulted in approximately 300,000 annotated video frames accessible through the MBARI Observations Data Base. Almost daily someone queries this wealth of information to identify video of features, events, and organisms and to correlate it with other environmental data by space, time and other criteria.
The critical element of our approach to this problem is enterprise wide information modeling. This approach provides a data architecture for all pertinent information created and used throughout the scientific process. This process essentially eliminates the metadata/data conceptual divide and enables a coherent, end-to-end data management strategy.
Assessment of effectiveness
This is a difficult issue in a community where data sharing beyond the creator is not highly valued. In the most fundamental sense, if the creator uses the system to store, analyze, and display their data and information, then we are being effective. However, sustainable success requires that the data is accessible and assessable to primary (the creator and her colleagues), secondary (other scientists, future scientists), and tertiary users (educators, students, the public). Effective access requires that users can acquire all, and only, the information which is pertinent and significant to their needs. In addition, the user must be able to assess the quality, authority, and relevance of that information which is retrieved.
If this is the ultimate measure of success, then a much wider effort is required, both technically and culturally. The information architecture must consider the needs of not only scientists, but other users of scientific information such as resource managers, educators, and students, data managers, librarians, etc. If the architecture can accommodate this need and we can create tools to support the creators in populating the data repository, then we are on the right track.
Vision for digital library information
The digital library of the 21st century will not be a monolithic, centralized facility with stores of electronic monograms and serials that people have remote access to via the collections catalog. In fact, to be successful, the library of the 21st century will be a critical hub in a completely reengineered process of knowledge transfer. Rather than be the receptacle for information artifacts at the end of the publication pipeline, the new age library will harness information technology to become an active agent in the creative processes of information creation, publication, dissemination, value added reuse of data and information, interactive collaboration between people, databases, and knowledgebases (spanning time and geography differences between participants), etc.
In addition, it seems clear that the 21st century library will be something different to each enterprise and/or community that it serves. It will enable each user to have his or her own view of collections and services that are relevant to the knowledge tasks of that individual. In fact, the library infrastructure should allow individuals and communities to create their own "virtual" library of collections and services with added value specific to a client base that that individual wants to serve.
Issues and impediments
1. Digital library information artifacts will have a much higher level of granularity than today's physical library artifacts (e.g., client may want to retrieve multimedia pieces from books and articles rather than the whole package). How does this affect cataloging practice and economic compensation models for authors? It seems that digital library projects that move forward technically without considering these issues are in danger of becoming no more than a nice proof of technical concept.
2. Accepting the premise described in item 1 requires consideration of a completely new look at the cataloging process in terms of what types of artifacts get cataloged, how artifacts which are aggregates of other cataloged items are cataloged, and how usage is tracked and creators protected/compensated while equitable and secure access is provided to the client. This new look can be framed in terms of the metadata problem for spatial data, as long as a broad view is taken in terms of all the users of data and the classes of data and information which require spatial access.
Each enterprise comprised of knowledge workers depending on spatial information should be modeled based on the following definitions:
Metadata: A class of data used to describe the content, structure, representation, and context of some well defined data artifact. Content metadata defines data items in terms of description, units, and legal value domains. Structure metadata defines the groupings of data items into logical aggregates (e.g., fields for some record type) which typically correspond to real world entities. Representation metadata defines the value representation for each data item and the physical format for the whole data artifact. Context metadata defines all ancillary information associated with the creation and use of the data artifact. Note that context metadata typically has content, structure, and representation metadata associated with it as well.
Information Model: A specification of the objects (things, people, events, concepts) in the real world about which one needs to maintain information. The specification should identify and define the objects, important attributes of the objects, and inherent relationships between the objects. The model may be expressed in a formal language and/or narrative text with a graphical depiction which follows well defined notation standards.
The information model should be a technology independent description of the enterprise being modeled; however, the role of technology should also be modeled.
The metadata mechanism, driven by the enterprise information model, must provide a consistent framework which accomplishes the following objectives: provide meaningful selection criteria for accessing pertinent data; support the translation of logical concepts between communities; support the exchange of data stored in differing physical formats; and support the assessment of data artifacts by consumers.
3. Each library user community must do a better job in understanding and specifying metadata needs and in establishing the policies, procedures and technical infrastructure required to implement the needed solution. There are several issues surrounding this challenge:
* Full documentation of data artifacts must be valued by the creators and they must be given the tools to ease the burden of this process.
* Creators, digital librarians, and users need clearly defined information standards and enforcement mechanisms assuring:
Minimal Consistency
Minimal Completeness
Minimal Assessability (QA and Lineage)
Standard Format: Content, Structure, Representation
* Since the 21st century digital library will be based on distributed holdings, an underlying compatibility between data management activities at the point of origin, added value access centers, and general archive centers. Some mechanism must be defined to coordinate these activities over long periods of time while allowing for autonomous control of technology assimilation and data management approach. Policies, procedures, and technology must be put into place and coordinated across all phases of the data artifact life cycle.
1. Is it possible to devise an approach to metadata that spans all information objects and is valid for DLI as a whole.
Probably the first thing we need to settle is the definition of metadata. The term is used by computer scientists to mean one thing and librarians to mean something else, with many other flavors beside. To quote from a recent IEEE paper:
To a computer systems engineer metadata means physical level information like file names and formats, data types, and hash tables, i.e., what is necessary to decode a sequence of bytes into basic elements recognized by a general purpose programming language. To a database manager metadata may mean the contents of a schema, i.e., names for all the classes of data objects in the database, a precise statement of all their attributes, and of the relationships between them, and a characterization of the questions that can be asked of the database. It may also mean a collection of rules and heuristics modeling standard operating procedures in some disciplinary domain, which can be used to frame and interpret interactions with users and other databases. To a physical scientist metadata may be a critical calibration constant, i.e., a number to be placed in a formula used to transform the data, or it may mean a natural language description of the measurement process of which the data was the outcome. To an intelligent novice exploring a new domain, it may simply be a guide to where to find more information.
My bias is toward metadata as a description of "content, quality, condition and other characteristics of data. Metadata help a person to locate and understand data" and to Bruce Gritton's definition in his paper for the Alexandria Design Review meeting: "A class of data used to describe the content, structure, representation, and context of some well defined data artifact." I am therefore closer to the last meaning in the above quote but from the point of view of the creation of interoperable content metadata so that data and information can be located and evaluated appropriately in an inter-networked environment.
The other critical contextual orientation that needs to be understood in this discussion is the level of metadata application: whether it is a local implementation, a domain or community implementation, or a network interoperability issue.
So, with that preamble, I say that it is possible to develop agreements on metadata for network interoperability across all of the DLI projects, but that this interoperability may be at a generalized, relatively shallow level. More specific metadata standards may be agreed upon between individual projects, possibly linked to the use of compatible software for discovery and use of the data and information.
2: Can we devise a clear definition of metadata, its component parts and functions, and its relationship to information granularity, that is robust and theoretically sound? What theoretical frameworks already exist?
An understanding of the components of metadata for the description of data/information artifacts is the key to this discussion. Existing standards, such as MARC and the FGDC Content Standard, contain a mixture of these components and too often the mixtures are presented and considered as a whole to the dismay of those who are considering their use in new situations. Problems are created when it is not clear what the purposes of the standards are and how the components are put together to serve those purposes.
One breakdown of metadata components might be:
Purpose: exchange, guide for the creation of descriptions,...
Data elements: primitive items of data
Structure: tagging, hierarchy or nesting, required or not required, etc.
Entry guidelines: formats of entry, sources of the information, etc.
Authorized (controlled) data values: subject heading lists, thesauri, category lists, etc.
This is a list of components that I developed to try to understand why grappling with metadata issues is so confusing and frustrating. I suspect that through discussion this list will be modified. But the point I am trying to make is that these components can be dealt with separately. Data elements can be and should be discussed separately from entry guidelines (cataloging rules). Authorized vocabularies for describing specific data attributes need to be dealt with separately. Specific implementations of metadata schemes can choose from different options for each component to meet particular purposes.
To illustrate what I mean, I would like to describe briefly a recent project that I headed for the NASA Scientific and Technical Information (STI) Program. Their RECON storage and retrieval system is being replaced by a COTS software product. This offered the chance to redesign the record structure. NASA and other federal STI programs have used the COSATI standard for their record format and their cataloging guidelines. COSATI is a standard only among these agencies and is not consistently implemented anyway; conversion programs were still needed to import records from other agencies. The design team looked to the USMARC record format to see if the data elements contained in it would handle the descriptive fields needed for the RECON technical report and open literature database. These data elements worked very well; relatively few local adaptations had to be made and these could be made with USMARC's local field options. Subsequently, the CENDI group of federal STI programs formally adopted the USMARC record format for the exchange standard among the federal agencies. Each agency will map its bibliographic record structure to USMARC for export and in the process they will gain experience with the USMARC format which they may decide to adopt internally. NASA did not use the USMARC leader or fixed fields internally. They did not adopt AACR2 cataloging rules which libraries use with the USMARC record. But the data elements proved to be very useful. I could go on talking about the advantages of going to the USMARC standard for the NASA STI program but won't do so here. The point is that it was very effective to adopt the data elements from USMARC for the purposes of the NASA STI Program and many advantages for data sharing resulted from this decision.
Metadata standards that combine data elements, entry guidelines, required and optional designations, and controlled value lists (or combinations of these) are best suited as profiles for particular communities of users. As such, they serve to guide the creation of metadata records; they lead to consistency of presentation and to a level of interoperability within the community that is greater than that available more generally through the network. But such standards inevitably need to be modified when they are adopted by other interest groups. A case in point is the recent effort by the National Biological Service to adapt the Federal Geographic Data Committee's Content Standards for Digital Geospatial Metadata for their use. The required and optional data elements are different for the biological scientists; some of the FGDC field labels are too specific to the geospatial applications; some needed descriptive elements have to be added; some category lists have to be expanded; etc.
I propose that we should work toward a standard at the network discovery and use level that defines sets of data elements from which domain groups can choose for their own specific metadata profiles. Mappings from local or domain metadata formats to the "registered data elements" would be the means by which metadata is shared and made available through the Internet. This would not substitute for the MARC exchange format or for any other exchanges that require a specific data structure. The way in which we construct, manage, and maintain a list of "registered data elements" needs to be reviewed. I believe that only a fully supported effort in this direction will result in a standard that is of long term use. Existing activities along this line include the Z39.50 registered attribute sets (e.g., bib-1, gils, stas, and the proposed geo attribute sets).
With an open, extensible metadata approach, the needs of individual communities will be met while maintaining the ability to access one another's data holdings through network discovery and use tools. There will be great variety in the way records are constructed and populated. The metadata structure needs to include modifiers or identifiers to indicate the particular entry guidelines followed in the creation of the record, the particular controlled vocabulary scheme used to indicate subject content, the language of the entry, etc. An example of this approach is the so-called "Dublin Core" metadata format that was developed under the sponsorship of NCSA and OCLC. Provision is made for "each element to be modified by an optional qualifier. If no qualifier is present, the element has its common-sense meaning; otherwise, the definition of the element is modified by the value of the qualifier." Examples include the use of the modifier "scheme=" to identify the subject heading list used, etc. Search tools for network discovery (and for local systems that house records based on multiple schema) will have to develop ways to take advantage of whatever control has been applied to the data elements, adapting to multiple implementations across distributed or combined data and information sources.
Terry Winograd of Stanford University included a list of metadata components and a categorization of "styles of standards" in his keynote address to the ACM SIGIR International Conference, Seattle, July 9-13, 1995. I took notes but didn't get a copy of his text, so I quote from the talk with caution. He included a list of metadata components (12 of them in my notes) and three styles of metadata standards: closed (e.g., email headers with unregistered extensions), kitchen sink (e.g., the Scientific and Technical Attribute Set (STAS)), and extensible components (units of element sets and an integrating framework). The STAS set may be a kitchen sink of data elements but it does represent the real world complexity and variety of existing information attributes. Whatever integrating framework is possible for an overall metadata standard, it must accommodate this great, changing variety.
For an "integrating framework" for metadata elements, I propose that elements of thesaurus structure be used to provide relationships between data elements. This would include broad facets or top terms that group categories of data elements together; broad term and narrow term relationships to relate broad data elements (e.g., Title) to more specific data elements (e.g., Series Title); related data elements to provide links between data elements not linked by the broad term, narrow term link; and "lead-in" terminology to lead someone from synonymous wording (or nearly synonymous) to the registered name for the data element. This structure is supported by NISO standard Z39.19, Guidelines for the Construction, Format, and Management of Monolingual Thesauri. The word "Monolingual" in the title of Z39.19 is a reminder that multilingual aspects of data element definition also need to be accommodated.
3. If necessary, can we identify domains of information and digital library function, and associated approaches to metadata definition?
My view, as stated above, is that domain communities must develop their own metadata profiles to meet their own unique needs. These profiles should be built on/from existing data element definitions, structures, entry guidelines, and controlled vocabularies and category schemes, with an open process of proposing and approving new data elements. Metadata templates or models can be used as starting points. Domain profiles will include guides for the creation of metadata, training workbooks, and possibly the maintenance of controlled vocabularies. The bridges or coordination of these domain profiles should be through international or ad hoc standards that address the components of metadata.
4. How do the information objects of special interest to Alexandria fit within broader frameworks? Can the progress already made on defining metadata for such objects be made more broadly compatible and interoperable?
Since I am not familiar with the work on metadata that Alexandria has done, I can't comment on any specifics. I should think, however, that others would want to take advantage of the work you have done and possibly negotiate with you on any adjustments needed to fit their special circumstances.
I am concerned that other collection builders become more aware of the geospatial aspects of all types of materials and describe these aspects according to the bounding box coordinate description (at a minimum) that can be used by retrieval systems that have geospatial search capability. This extends to the SGML coding of technical reports and articles where the study areas, if present, should be described spatially and coded in such a way that they can be recognized as spatial descriptions. Alexandria can influence this awareness through the DLI projects and through public promotion of the power of geospatial retrieval across all types of data and information.
5. Does a digital library, with its associated concepts of information granularity and content-based search, require an entirely novel approach to metadata for these objects? Are there elements of such a novel approach that might help bridge the distinctions between classes of information objects?
The notion of "novel" is probably different in this case depending on one's point of view and experience base. A solution from computer science may look novel to the library side, for example. Novel approaches are also considered suspect when proposed from the outside to existing operating cultures. I believe that the solution lies in looking for the bridges and generalizations among the current approaches rather than in striving for novel approaches.
6. What mechanisms exist for promoting the results of DLI research on metadata within the broader community?
The National Information Standards Organization (NISO) is reviewing the metadata issue at the next meeting of their Standards Development Committee. They should be encouraged to establish a committee for a metadata standard and the committee should have active representation from the DLI community.
DLI should also participate in the coming IEEE Metadata Workshop, April 16-18, 1996, in Silver Spring, MD. Deadline for paper abstract, poster-demo abstract, panel proposal submission is December 10, 1995; acceptance/rejection notification is January 25, 1996; final hard copies of the papers (4-6 pages) are due 27 March 1996.
CERES overview
The California Environmental Resources Evaluation System (CERES) has as one of its primary goals the exchange of environmental and cultural information between CA state government agencies and the public sector. Our current undertakings in this task have been to try and design a strategy where individual organizations maintain their own datasets and distribute metadata regarding those datasets with a standard method. The effort has been awkward in the sense that the methods of data description and dissemination have outpaced our ability to provide tools using these methods.
CERES metadata efforts have been concentrated on the creation of a set of metadata standards. The standards are Standard Generalized Markup Language (SGML) document type definitions (DTD). The Metadata standard consists of four DTDs, used to document datasets. The main basis for this is the FGDC's Content Standards for Digital Geospatial Metadata, and the NBS's Content Standards for Non-Geospatial Metadata. Modifications have been made to both, however, to include other features, including:
* Explicit Hytime linking between metadata documents
* Identification of individual metadata through the citation reference
* Some rearrangement and simplification of hierarchal arrangement
The other standard relates to data providers describing themselves and links to their metadata. The goal of this standard is to allow for the creation of a distributed set of metadata catalogs, which is freely available in a standard manner.
These standards are all described in the CERES Metadata Standard Page ( http://ceresweb.ucdavis.edu/standards/index.html ). The standards are currently not complete.
Questions
In answering these questions I can already anticipate some of the problems in reaching common ground with respect to metadata. The relationship of metadata to data, to the DLI as a whole, and even the word choice, differ from my usual way of thinking about metadata. The question I have is how can efforts in metadata like that of the DLI affect the practical creation and maintenance of metadata records for other organizations wishing to share their data more effectively.
1. Is it possible to devise an approach to metadata that spans all information objects, and is valid for DLI as a whole?
I think that any definition of metadata will end up as having a number of functions or elements that can be thought of as common to all information objects, and then another set that are clearly related only to a subset of objects. The further into commonality metadata can be pushed, the better, but some grand federation of all metadata aspects, I think, is not a practically obtainable goal.
2. Can we devise a clear definition of metadata, its component parts and functions, and its relationship to information granularity, that is robust and theoretically sound? What theoretical frameworks already exist?
No one seems to agree on any definition of metadata more elaborate than "data about data". Here's a pessimistic, but fairly accurate definition from the CNI White Paper on Networked Information Discovery and Retrieval (http://www.cni.org:80/projects/nidr/www/outline.all.html)
Metadata, literally, means "data about data"; our research has traced the earliest use of this term back to about 1976...The origins of the term are murky; it seems to have been used to describe a range of concepts that evolved in areas such as the scientific data management, information management, archival, distributed/federated database, and artificial intelligence research communities during the 1970s and 1980s...However, our feeling is that at this point "metadata" as a descriptive term has become so debased by overuse (and means so many different things in different communities and contexts) that it is now virtually meaningless without extensive qualification; unfortunately, it has also become a very fashionable term. The very vagueness of the term metadata today makes it easy to offer sophisticated-sounding proposals about using metadata in various ways which seem to be almost impossible to reduce to practice, or which are extremely pedestrian when actually implemented.
It would probably be difficult even to come to general agreement on the functions that metadata is supposed to serve. Here's a list of the most of the functions that I would see metadata as serving. The perspective here is basically on geospatial digital metadata.
Metadata is a document: One major function of metadata is as a document that can be shared among prospective users, so that they can get a handle on the content, quality, etc. of a dataset, and judge its usefulness to them.
Metadata is a database: Another major function of metadata is its use in information discovery and retrieval.
Metadata is a format description: Another use of metadata is to describe the machine representation of the dataset. This could include application descriptions (i.e., Wordperfect file), geospatial framework descriptions, or descriptions of character and number representations, within the dataset.
Metadata is a history: Certainly for geospatial data, there is a clear need for the lineage of a dataset to be well documented.
Metadata is a link: Metadata can also be used to point to other pertinent information sources, including documents, or points of contact.
3. If necessary, can we identify domains of information and digital library function, and associated approaches to metadata definition?
4. How do the information objects of special interest to Alexandria fit within broader frameworks? Can the progress already made on defining metadata for such objects be made more broadly compatible and interoperable?
There is certainly the need for better interoperability. An optimistic viewpoint is that any metadata definition regarding, say, geospatial datasets, will be about 75-90% compatible with another metadata definition of geospatial datasets. However, there's currently pretty much no chance of interoperability, at least in the sense that one could retrieve metadata from one source, and easily plug it into a metadata collection from another source. CERES's metadata standard effort is a try at describing an interoperable transfer scheme.
5. Does a digital library, with its associated concepts of information granularity and content-based search, require an entirely novel approach to metadata for these objects? Are there elements of such a novel approach that might help bridge the distinctions between classes of information objects?
I really don't see DLI's requirement for a novel approach to metadata. I think the most pressing need is for a metadata concept that allows an inheritance method between larger information objects and their associated objects, and their objects...and their granules.
I guess I feel that once you get into the ideas of content-based searching, you quickly leave the realm of metadata and enter the realm of data mining. Finding Interstates, for example, might mean looking for I-??? in text, and long straight lines in photos. None of this information would I see as associated with metadata.
6. What mechanisms exist for promoting the results of DLI research on metadata within the broader community?
Certainly, it is in everyone's best interest to start getting a broader use of a single scheme for metadata.
I see the best mechanism for promoting the results, and promulgating DLI metadata concepts, is probably by the creation and distribution of applications which practically demonstrate the utility of those concepts.
Best to make my confession at the start! I am a mathematician, not a librarian. And I am not a metadata expert. But I have wrestled with some problems of constructing indexes of Internet resources for the High Performance Computing and Communications Program. And the company I represent, Interconnect Technologies Corporation, has worked with government and commercial clients to develop catalog records as a step in the process of indexing information assets, such as technical reports and software.
So I appreciate the importance of metadata as an extension of the concept of a catalog and the need for standardization. I am eager to hear the views of the experts at the metadata workshop and to participate in discussions (and, hopefully, solutions!) of the problems that we face as practitioners. I believe that computer technologists and librarians working together to define and standardize metadata structures and categories can help make the Internet a much better place for finding information.
A little background will help make my answers to your questions more intelligible. The word metadata is typically used in a more general sense than how we use it. At Interconnect we think of each catalog record as a sort of proxy for an information asset. We think of each catalog record as comprised of two parts: the (normally static) asset-description part that would be typical of, say, a USMARC record, as well as a (dynamic) part that documents the progress of the asset throughout its life cycle. This second part we call the "transactional metadata". Viewed in this way, a catalog record can be thought of as a simplified model of an asset along with its history and current state within the controlling organization.
Here, then, are my answers to your questions:
1. Is it possible to devise an approach to metadata that spans all information objects, and is valid for DLI as a whole?
The only experience I am aware of that could be used as an empirical basis for answering this very broad question is that of the libraries and their use of MARC. Certainly MARC is a complicated system, but its complexity may be necessary in view of the amazing variety of objects MARC has drawn within its scope. Moreover, (US)MARC is a standard. I suspect that MARC can serve as a foundation for a solution to the problem, but the result might be a much transformed MARC.
Another way to ask the question might be: Can the Internet community assimilate MARC and can MARC be adapted to the full scope of DLI needs? If the requirement is to devise a standard that embraces all information objects, the alternative seems to be to create an equally complex standard that is parallel to MARC but called something else. I think it remains to be seen whether we can devise one approach to metadata that can be stretched to fit all; but it seems that a good starting point would be to begin with the demonstrably stretchable MARC. The success of MARC thus far seems to imply that a solution can be achieved, but MARC's complexity implies that it won't be easy.
2. Can we devise a clear definition of metadata, its component parts and functions, and its relationship to information granularity, that is robust and theoretically sound? What theoretical frameworks already exist?
At Interconnect, we are beginning to address the question of how well can MARC deal with hierarchically structured assets, such as programs and subroutines within a software package. We do not have an answer to this yet, much less an answer to whether MARC can deal with graph structures more general than hierarchies. But if MARC is not satisfactory, is it better to start over with something new (and face the problem of constructing a new standard complex enough to deal with objects that MARC deals with now--as posed in the answer to the previous question)? Or is it better to work on MARC to stretch it still further?
This raises another related question: If you buy into the idea that metadata should include life-cycle transactions, on what theoretical grounds can we proceed? Interconnect has had some encouraging experience in this area. I mentioned above our proclivity to think of a catalog record as consisting of a MARC-like descriptive part and a dynamic transactional-metadata part.
We have worked under government contract to design a directory-based system for administering a NASA software repository: COSMIC, NASA's point of sale for NASA-funded software; we worked closely with a consulting Cornell librarian experienced in cataloging digital media. We used MARC 96X locally-defined fields for incorporating the transactional metadata within a USMARC framework. Our theoretical starting point for organizing the transactional metadata was Bearman's Reference Model for Business Acceptable Communications. We have not had enough experience to say how generally applicable this approach is, but we have had favorable initial results working with COSMIC, using catalog records to document the progress of software, from the time of its submission through review, evaluation, editing, acceptance, sales, and eventual retirement.
3. If necessary, can we identify domains of information and digital library function, and associated approaches to metadata definition?
In view of the cost of full-bore MARC cataloging or its future functional equivalent under some name other than MARC, it may be that a tiered approach to DLI information will evolve in which most assets are "scantily" cataloged (by some sort of MARC-Lite, if you will), but others will be more richly cataloged, depending on their importance to users in government and industry, for which a charge-back mechanism (or other form of support) could be devised to defray the costs of cataloging.
An important start at creating a simplified "pre-MARC" catalog record has already been made. The "Dublin Group" was formed earlier this year at the OCLC/NCSA Metadata Workshop in Dublin, Ohio. The Dublin Group, headed by Stuart Weibel of OCLC, has begun devising the Dublin Reference Model for a Core Metadata Elements Set as a shortcut for authors for reducing the initial cost of MARC-like cataloging. (Weibel is listed among the attendees at the DLI Metadata Workshop; it will be interesting to hear what he has to say in this regard.)
So perhaps we could think in terms of a graded continuum of mutually consistent cataloging schemes, going from the simplest and most inexpensive forms, all the way to the most detailed and complete. In this way, small organizations pressed for funds could offer users "entry-level" metadata, but larger more prosperous organizations with valued data in great demand could afford (and profit from) making their data more easily browsed and searched by means of richer metadata.
Here are briefer answers to your last three questions:
4. How do the information objects of special interest to Alexandria fit within broader frameworks? Can the progress already made on defining metadata for such objects be made more broadly compatible and interoperable?
I am sorry that I am not familiar enough with Alexandria to comment.
5. Does a digital library, with its associated concepts of information granularity and content-based search, require an entirely novel approach to metadata for these objects? Are there elements of such a novel approach that might help bridge the distinctions between classes of information objects?
Above I expressed optimism that this could be done.
6. What mechanisms exist for promoting the results of DLI research on metadata within the broader community?
Students graduating from DLI projects and working in industry may prevail on management to improve practice. This has been a traditional path for high-tech innovation. Another is that Industrial partners may commercialize aspects of the DLI approaches learned through participation in the DLI program. We hope that we, along with our colleagues in the DLI projects, may serve as examples.
In conclusion, we at Interconnect appreciate the importance of well thought-out catalogs and welcome the opportunity of working within the DLI community to advance the process of metadata standardization.
Workshop Participants
Anderson, Jean T. Programmer ICESS/UCSB 8544 Cliffridge Ave. La Jolla, CA 92037-2110 (619) 453-7315 jta@cts.com Beard, Kate Associate Professor - NCGIA University of Maine 344 Boardman Hall UMaine, Orono,ME 04469 207 581-2147 beard@spatial.maine.edu Carver, Larry Map and Imagery Lab Director University of California, Santa Barbara Davidson Library Santa Barbara, CA 93106 (805) 893-4049 carver@sdc.ucsb.edu Cohen, Doreen Manager, Library Services (SCSC Contract) NASA Ames Research Center M.S. 202-3 (415) 604-6325 doreen_cohen@qmgate.arc.nasa.gov Delcambre, Dr. Lois Computer Science and Engineering Department P.O. Box 91000 Oregon Graduate Institute (503) 690-1689 lmd@cse.ogi.edu Drucker, Marilyn Physical Scientist Defense Mapping Agency Mailstop D85 4600 Sangamore Road Bethesda, MD 20816-5003 (301) 227-5045 druckerm@dma.gov Fischer, Christoph Project Coordinator Alexandria Digital Library Santa Barbara, CA 93106 (805) 893-8589 fischer@alexandria.sdc.ucsb.edu Folk, Mike HDF Project Manager NCSA Software Development Group 605 E. Springfield Ave, Champaign, IL 61820 (217)-244-0647 mfolk@ncsa.uiuc.edu Foster, Howard UC Berkeley 483 Soda Hall (510) 642-8234 hfoster@cs.berkeley.edu Frew, James Specialist ICESS University of California, Santa Barbara, CA 93106-3060 (805) 893 7356 frew@icess.ucsb.edu Gardels, Kenn University of CEDR 390 Wurster Hall Berkeley, CA 94720-1839 +1 (510) 642-9205 gardels@regis.berkeley.edu kgardels@ogis.org Gilbert, Mike Central Imagery Office 8401 Old Courthouse Rd. Vienna, VA 22182-3820 (703) 275-5653 mbgilbert@aol.com Goodchild, Mike UCSB NCGIA Santa Barbara, CA 93106 good@ncgia.ucsb.edu (805) 893-8049 Gottsegen, Jonathan Department of Geography National Center for Geographic Information and Analysis (NCGIA) University of California, Santa Barbara Santa Barbara, CA 93106-4060 (805) 893-8652 Jgotts@geog.ucsb.edu Gritton, Bruce Manager Monterey Bay Aquarium Research Institute Computer & Information Services 160 central Avenue Pacific Grove, CA 93950 (408) 647-3700 grbr@mbari.org Hart, Quinn Dept. Land, Air, and Water Resources UC Davis qjhart@ucdavis.edu Hill, Linda Global Change Data Management Working Group College of Lib. and Info. Services, Univ. of Maryland Ctr of Excellence in Space Data and Info Sci. (USRA) (301) 286-8875 lhill@usra.edu Johnston, Doug Director, Geographic Modeling Systems Lab Research Scientist, National Center for Supercomputing Applications University of Illinois at Urbana-Champaign 220 Davenport Hall 607 S. Mathews Ave Urbana, IL 61801 (217)244-5995 johnston@stumpy.gis.uiuc.edu Lal, Nand Manager, Digital Library Technology Project NASA/Goddard Space Flight Center, Code 935 Greenbelt, MD 20771 (301) 286-7350 nand@voyager.gsfc.nasa.gov Larsgaard, Mary Assistant Head, Map and Imagery Lab Davidson Library University of California, Santa Barbara 93106 (805) 893-4049 mary@sdc.ucsb.edu Mangan, Elizabeth Head Geography and Map Division Data Preparation Unit Library of Congress Washington, DC 20540 (202) 707-8520 manga@mail.loc.gov Marcus, Craig Sr. Research Programmer Informedia Project, CMU 5000 Forbes Ave, Pgh, PA 15213 (412)268-8970 neek@cs.cmu.edu Miller, Eric Associate Research Scientist Online Computer Library Center, Inc. 6565 Frantz Rd. Dublin, Oh., 43017 (614) 764-6081 emiller@oclc.org Nagele, Paul A. (703) 285-9238 nagelep@dma.gov Nebert, Doug Clearinghouse Coordinator USGS Federal Geographic Data Committee MS 590 National Center, Reston, VA 22092 (703) 648-5691 ddnebert@usgs.gov Overman, Dr. Ron SBER, Room 995 National Science Foundation 4201 Wilson Blvd. Arlington, VA 22230 roverman@nsf.gov Raugh, Mike Vice President Interconnect Technologies Corporation P. O. Box 4158, Mountain View, CA 94040-0158 (415) 691-4022 raugh@interconnect.com Smith, Terry UCSB CCSE/Alexandria Digital Library Santa Barbara, CA 93106 (805) 893-2966 smithtr@cs.ucsb.edu Stevens, Scott scott.stevens@sei.cmu.edu Stoms, Dr. David M. Manager, Biogeography Lab Department of Geography University of California, SB 93106-4060 (805) 893-7655 stoms@geog.ucsb.edu Su, Jianwen UCSB Computer Science Department Santa Barbara, CA 93106 su@cs.ucsb.edu Tolar, Billy FGDC Metadata Standard Coordinator 590 National Center Reston, VA 22092 (703) 648-7759 btolar@usgs.gov Weibel, Stuart Senior Research Scientist OCLC Office of Research (614) 764-6081 weibel@oclc.org Wilder, Rayette Project Librarian Center for Scholarly Technology University of Southern California Los Angeles, CA 90089-0182 (213) 740-7123 wilder@usc.edu Zemankova, Maria National Science Foundation 4201 Wilson Blvd. Arlington, VA 22230 (703) 306-1930 mzemankova@nsf.gov