NGDA First-year Roadmap

This roadmap describes the general direction and scope of the technical development UCSB and Stanford will undertake in the first year of the NGDA project. It reflects some initial decisions about which issues will be addressed and how they will be addressed, and equally importantly, which issues the project will not address, at least not in the first year. During and immediately after development of an initial prototype architecture and system, these decisions will be re-evaluated as experience is gained and new issues come to light.

Contents

Goal

The ultimate goal of the project is to answer the question:

How can we preserve geospatial data on a national scale and make it available to future generations?

Notice that the focus here is on a particular type of information—geospatial data—as opposed to other types of information, structured or unstructured. Notice, too, that the project is looking at preservation on a national scale, as opposed to larger or smaller (e.g., campus-wide) scales. The scale is suggestive of the quantity and types of geospatial data to be preserved as well as likely providers and users of such data. Finally, notice the emphasis on future generations, i.e., on long-term preservation.

General direction

To begin answering the above question, we will develop two artifacts.

First, we will develop a prototype archive for geospatial data that addresses the issues related to long-term preservation of digital information in general as well as those specific to geospatial data. The prototype will be a complete, end-to-end system providing both ingest and access interfaces; will be capable of storing low-order millions of items occupying several terabytes; and will be populated both internally by the project and by at least one external provider using the archive's "push" interface (cf. requirement 3 below). The archive will provide two access mechanisms: ADL, to support end user access; and bulk metadata harvesting via OAI to support third-party, value-added services. The archive implementation will define an internal interface that separates the archive's functionality from that of the underlying storage system, thereby facilitating commoditization of storage systems.

Second, we will develop a federated archive architecture that will allow multiple, independently-developed and operated, distributed archives to participate in a larger, unifying federation. The architecture will specify the conditions for participation in the form of programmatic interfaces that archives must implement and other implementation requirements and guarantees that must be satisfied. The architecture will also provide to end users a single, unified view of distributed archival content. The aforementioned prototype archive will serve as the exemplary node in the federation. The Stanford Digital Repository will serve as a second node.

The following requirements and scope limitations further characterize the first-year technical development.

Requirements

  1. The archive will support long-term preservation and access.
    The project is focused on making geospatial data accessible and usable now and arbitrarily far into the future, certainly beyond the time when the applications that created the data are still in active use or even exist. Therefore, the archive will address:
    • persistence, uniqueness, and scoping of item identifiers;
    • characteristics of and policies on acceptable data and metadata formats;
    • metadata requirements; and
    • explicit representation of item semantics.
  2. The archive will be highly scalable.
    Given its national scale, the archive architecture must scale to tens of millions of items occupying hundreds of terabytes.
  3. The archive will provide a "push" interface.
    It is not feasible for any one organization to both operate and populate a nation-wide archive, even given the presence of multiple archives; partner institutions must be enlisted and given incentives to do the bulk of the work in preparing, ingesting, and managing data in the archive. To this end, the archive will support a "push" interface that allows data providers to directly ingest and manage items in the archive.
  4. Archive content will be online.
    To be both immediately useful and politically and financially sustainable, archive content must be online and available to end users. Here, "online" specifically means meeting the expectations of Web users, i.e., being available readily enough that the transition to an "order"-type interface (i.e., an interface in which users place orders and asynchronously receive notification of order satisfaction) can be averted. Such an interface would represent a qualitative change in how the archive could be used and is incompatible with the data delivery services the archive will need to provide.

Desirable features

  1. The archive should address intellectual property issues.
    Limiting the archive to data in the public domain is unnecessarily restrictive and hurts the usefulness and sustainability of the archive. To accommodate data not in the public domain, the archive should support formal modeling of intellectual property constraints (required attributions, commercial use restrictions, etc.) and, of course, obey them to the extent required.
  2. The archive should support 3rd-party-only data delivery.
    Also in support of data not in the public domain, the archive should allow data to be placed in the archive with the proviso that, until some well-defined point in time, data access requests will be redirected to the original provider or some other party that may wish to apply additional access constraints.

Scope limitations

  1. The archive will accept only structured items that are defined by metadata.
    To support identification, storage, migration of, and access to geospatial data, the archive—and in particular, its "push" interface—will accept only items defined by archive-approved structural and descriptive metadata. Semi-structured and unstructured items, such as hypertext and text documents, will be accommodated only to the extent that sufficient metadata for such data types either already exists or can be automatically constructed.
  2. The archive will accept only items for which metadata is freely redistributable.
    The archive architecture described below is fundamentally predicated on using item metadata to advertise the existence of items, search for items, and evaluate item appropriateness. Metadata that is encumbered in any way breaks all these uses. At some point in the future the project may want to distinguish between metadata that supports discovery and evaluation versus "other" metadata, but for the first year at least, the project will confine itself to public domain metadata. Put another way, the burden is on providers to provide suitable and sufficient publicly-available metadata.
  3. The project will not develop an archival storage system.
    Secure, long-term storage requires that bits be verified, replicated at physically distant locations, synchronized, and regularly migrated to new storage media. Both commercial and non-commercial systems are readily available for performing these tasks, and the project will build on them.
  4. The archive will not support fine-grained authentication.
    Given that the archive will be storing mostly public-domain data, and will be operating on national data for the benefit of the nation at large, there is no immediate need for the archive to support fine-grained (i.e., item- and user-level) authentication and access controls. This does not preclude archive-wide authentication or, as mentioned above, modeling of intellectual property rights.
  5. No data will be transferred to the Library of Congress.
    Any such data transfers will be addressed in a subsequent project year, by which time it is anticipated that the Library and the other NDIIPP awardees will have agreed on the details of such transfers, the conditions under which they occur, and the technical infrastructure by which they're accomplished.
  6. The archive will not check for duplicate items.
    What to do about duplicate items in the archive—group them? conflate them? consolidate them? nothing?—can be considered only after the problem has been encountered in practice, if it ever is to any significant extent.

Open issues

  1. Handling of services associated with archive content.
    The Web's (i.e., HTTP's) only access mechanism—downloading data in its entirety—is, when applied to geospatial data, in many cases insufficient. In some cases, it is merely impractical to download geospatial data because of its large size; in other cases, geospatial data takes the form of a complex database that is not intended to be or useful when downloaded in toto (cf. next open issue). For these reasons, geospatial data is often made available via programmatic services (map services such as WMS, gridded query services such as OPeNDAP, progressive image delivery services such as MrSID, etc.). But in the context of an archive, who runs such services? If the services are hosted externally (e.g., by the original data provider), how are archive data and services associated, and how is synchronization maintained between the archive's copy of the data and the data offered by the services? At a minimum, the archive will need the ability to easily manage associated, changing services.
  2. Handling of database-like geospatial data.
    Some geospatial data takes the form of a relatively large number (millions) of relatively small (kilobyte-size) pieces of data tied together through an access service. Examples include long-term instrument observations made available through a protocol such as OPeNDAP and perhaps described by a THREDDS catalog. Treating each observation as a separate archive item, or, at the opposite extreme, somehow packaging the data into a single download, are unlikely to be satisfactory approaches to archiving such data because such models do not conform to how the data is described or used in practice.
  3. Defining and representing item semantics.
    Requirement 1 above states that the archive must explicitly represent item semantics in order to support use of archive items arbitrarily far into the future. But an item's "semantics" can be difficult to delimit and define. Consider an archive item that is a Landsat image in TIFF format. It is clear that knowledge of the TIFF format (what some would call the item's "structural metadata") is necessary to use the item. But that level of "semantics" only describes how the item can be unpacked and interpreted as an abstract two-dimensional image, not what the pixels mean. How much of Landsat sensor characteristics, the Landsat program, satellite information, etc., is also necessary to interpret the item? How much of this information should be stored in the archive? And how should it be represented?
  4. Representing change over time.
    Does the archive need to be able to represent and group together multiple versions of items?

Preliminary architecture

A preliminary archive architecture is illustrated below. Principal areas of development are indicated in yellow.

architecture diagramAn archive consists of a suite of ingest/management services sitting in front of an archival storage system. The ingest/management services present a programmatic interface that data providers can use to "push" data into the archive; these services may work cooperatively with other services such as gazetteer, thesaurus, metadata mapping, and file validation and conversion services. The archival storage system is presumed to provide all functionality related to secure, long-term, distributed storage of simple files in an archive-local hierarchical namespace. On top of this basic functionality, the ingest/management services define and enforce a data model appropriate for geospatial data; assign and resolve persistent names; perform any necessary validations and mappings; store semantic definitions and explicitly associate archive content with those definitions; and associate data with access services.

We emphasize that the ingest/management services are, as currently envisioned, entirely automated: they simply give an accredited provider the ability to ingest and manage data within the archive. Human-mediated ingest functionality (quality control, manual cleanup and preprocessing of metadata and data, curatorial approval and other workflow management, etc.) are outside the scope of the current architecture, but would naturally fit as additional tools in front of the ingest/management services.

On the other side of the archive, mapping tools make archive content accessible. Mapping to ADL involves mapping archive metadata to the ADL metadata views, grouping content into collections, and creating appropriate search indices.

The federation architecture is provided by ADL and, in particular, by ADL's collection discovery services. Federation over OAI is easily accomplished using off-the-shelf OAI aggregation tools.

created 2004-12; last modified 2009-01-12 15:01