Greg Janée >
NGDA >
NGDA First-year Roadmap
NGDA First-year Roadmap
This roadmap describes the general direction and scope of the
technical development UCSB and Stanford will undertake in the first
year of the NGDA project. It
reflects some initial decisions about which issues will be addressed
and how they will be addressed, and equally importantly, which issues
the project will not address, at least not in the first year.
During and immediately after development of an initial prototype
architecture and system, these decisions will be re-evaluated as
experience is gained and new issues come to light.
Contents
The ultimate goal of the project is to answer the question:
How can we preserve geospatial data on a national scale and make
it available to future generations?
Notice that the focus here is on a particular type of
information—geospatial data—as opposed to other types of
information, structured or unstructured. Notice, too, that the
project is looking at preservation on a national scale, as opposed to
larger or smaller (e.g., campus-wide) scales. The scale is suggestive
of the quantity and types of geospatial data to be preserved as well
as likely providers and users of such data. Finally, notice the
emphasis on future generations, i.e., on long-term preservation.
To begin answering the above question, we will develop two
artifacts.
First, we will develop a prototype archive for geospatial
data that addresses the issues related to long-term preservation
of digital information in general as well as those specific to
geospatial data. The prototype will be a complete, end-to-end system
providing both ingest and access interfaces; will be capable of
storing low-order millions of items occupying several terabytes; and
will be populated both internally by the project and by at least one
external provider using the archive's "push" interface
(cf. requirement 3 below). The archive will provide two access
mechanisms: ADL, to
support end user access; and bulk metadata harvesting via OAI
to support third-party, value-added services. The archive
implementation will define an internal interface that separates the
archive's functionality from that of the underlying storage system,
thereby facilitating commoditization of storage systems.
Second, we will develop a federated archive architecture
that will allow multiple, independently-developed and operated,
distributed archives to participate in a larger, unifying federation.
The architecture will specify the conditions for participation in the
form of programmatic interfaces that archives must implement and other
implementation requirements and guarantees that must be satisfied.
The architecture will also provide to end users a single, unified view
of distributed archival content. The aforementioned prototype archive
will serve as the exemplary node in the federation. The Stanford
Digital Repository will serve as a second node.
The following requirements and scope limitations further
characterize the first-year technical development.
- The archive will support long-term
preservation and access.
The project is focused on making geospatial data accessible and usable
now and arbitrarily far into the future, certainly beyond the time
when the applications that created the data are still in active use or
even exist. Therefore, the archive will address:
- persistence, uniqueness, and scoping of item identifiers;
- characteristics of and policies on acceptable data and metadata
formats;
- metadata requirements; and
- explicit representation of item semantics.
- The archive will be highly
scalable.
Given its national scale, the archive architecture must scale to tens
of millions of items occupying hundreds of terabytes.
- The archive will provide a "push"
interface.
It is not feasible for any one organization to both operate and
populate a nation-wide archive, even given the presence of multiple
archives; partner institutions must be enlisted and given incentives
to do the bulk of the work in preparing, ingesting, and managing data
in the archive. To this end, the archive will support a "push"
interface that allows data providers to directly ingest and manage
items in the archive.
- Archive content will be online.
To be both immediately useful and politically and financially
sustainable, archive content must be online and available to end
users. Here, "online" specifically means meeting the expectations of
Web users, i.e., being available readily enough that the transition to
an "order"-type interface (i.e., an interface in which users place
orders and asynchronously receive notification of order satisfaction)
can be averted. Such an interface would represent a qualitative
change in how the archive could be used and is incompatible with the
data delivery services the archive will need to provide.
- The archive should address intellectual
property issues.
Limiting the archive to data in the public domain is unnecessarily
restrictive and hurts the usefulness and sustainability of the
archive. To accommodate data not in the public domain, the archive
should support formal modeling of intellectual property constraints
(required attributions, commercial use restrictions, etc.) and, of
course, obey them to the extent required.
- The archive should support 3rd-party-only data
delivery.
Also in support of data not in the public domain, the
archive should allow data to be placed in the archive with the proviso
that, until some well-defined point in time, data access requests will
be redirected to the original provider or some other party that may
wish to apply additional access constraints.
- The archive will accept only structured
items that are defined by metadata.
To support identification, storage, migration of, and access to
geospatial data, the archive—and in particular, its "push"
interface—will accept only items defined by archive-approved
structural and descriptive metadata. Semi-structured and unstructured
items, such as hypertext and text documents, will be accommodated only
to the extent that sufficient metadata for such data types either
already exists or can be automatically constructed.
- The archive will accept only items for
which metadata is freely redistributable.
The archive architecture described below is fundamentally predicated
on using item metadata to advertise the existence of items, search for
items, and evaluate item appropriateness. Metadata that is encumbered
in any way breaks all these uses. At some point in the future the
project may want to distinguish between metadata that supports
discovery and evaluation versus "other" metadata, but for the first
year at least, the project will confine itself to public domain
metadata. Put another way, the burden is on providers to provide
suitable and sufficient publicly-available metadata.
- The project will not develop an archival
storage system.
Secure, long-term storage requires that bits be verified,
replicated at physically distant locations, synchronized, and
regularly migrated to new storage media. Both commercial and
non-commercial systems are readily available for performing these
tasks, and the project will build on them.
- The archive will not support fine-grained
authentication.
Given that the archive will be storing mostly public-domain data, and
will be operating on national data for the benefit of the nation at
large, there is no immediate need for the archive to support
fine-grained (i.e., item- and user-level) authentication and access
controls. This does not preclude archive-wide authentication or, as
mentioned above, modeling of intellectual property rights.
- No data will be transferred to the Library
of Congress.
Any such data transfers will be addressed in a subsequent project
year, by which time it is anticipated that the Library and the other
NDIIPP awardees will have agreed on the details of such transfers, the
conditions under which they occur, and the technical infrastructure by
which they're accomplished.
- The archive will not check for duplicate
items.
What to do about duplicate items in the archive—group them?
conflate them? consolidate them? nothing?—can be considered only
after the problem has been encountered in practice, if it ever is to
any significant extent.
- Handling of services associated with
archive content.
The Web's (i.e., HTTP's) only access mechanism—downloading data
in its entirety—is, when applied to geospatial data, in many
cases insufficient. In some cases, it is merely impractical to
download geospatial data because of its large size; in other cases,
geospatial data takes the form of a complex database that is not
intended to be or useful when downloaded in toto (cf. next open
issue). For these reasons, geospatial data is often made available
via programmatic services (map services such as WMS,
gridded query services such as OPeNDAP, progressive image delivery
services such as MrSID,
etc.). But in the context of an archive, who runs such services? If
the services are hosted externally (e.g., by the original data
provider), how are archive data and services associated, and how is
synchronization maintained between the archive's copy of the data and
the data offered by the services? At a minimum, the archive will need
the ability to easily manage associated, changing services.
- Handling of database-like geospatial
data.
Some geospatial data takes the form of a relatively large number
(millions) of relatively small (kilobyte-size) pieces of data tied
together through an access service. Examples include long-term
instrument observations made available through a protocol such as
OPeNDAP and perhaps described by a THREDDS
catalog. Treating each observation as a separate archive item, or, at
the opposite extreme, somehow packaging the data into a single
download, are unlikely to be satisfactory approaches to archiving such
data because such models do not conform to how the data is described
or used in practice.
- Defining and representing item
semantics.
Requirement 1 above states that the archive must
explicitly represent item semantics in order to support use of archive
items arbitrarily far into the future. But an item's "semantics" can
be difficult to delimit and define. Consider an archive item that is
a Landsat image in TIFF format. It is clear that knowledge of the
TIFF format (what some would call the item's "structural metadata") is
necessary to use the item. But that level of "semantics" only
describes how the item can be unpacked and interpreted as an abstract
two-dimensional image, not what the pixels mean. How much of Landsat
sensor characteristics, the Landsat program, satellite information,
etc., is also necessary to interpret the item? How much of this
information should be stored in the archive? And how should it be
represented?
- Representing change over time.
Does the archive need to be able to represent and group together
multiple versions of items?
A preliminary archive architecture is illustrated below. Principal
areas of development are indicated in yellow.
An archive consists
of a suite of ingest/management services sitting in front of an
archival storage system. The ingest/management services
present a programmatic interface that data providers can use to "push"
data into the archive; these services may work cooperatively with
other services such as gazetteer, thesaurus, metadata mapping, and
file validation and conversion services. The archival storage
system is presumed to provide all functionality related to
secure, long-term, distributed storage of simple files in an
archive-local hierarchical namespace. On top of this basic
functionality, the ingest/management services define and enforce a
data model appropriate for geospatial data; assign and resolve
persistent names; perform any necessary validations and mappings;
store semantic definitions and explicitly associate archive content
with those definitions; and associate data with access services.
We emphasize that the ingest/management services are, as currently
envisioned, entirely automated: they simply give an accredited
provider the ability to ingest and manage data within the archive.
Human-mediated ingest functionality (quality control, manual cleanup
and preprocessing of metadata and data, curatorial approval and other
workflow management, etc.) are outside the scope of the current
architecture, but would naturally fit as additional tools in front of
the ingest/management services.
On the other side of the archive, mapping tools make
archive content accessible. Mapping to ADL involves mapping archive
metadata to the ADL metadata views, grouping content into collections,
and creating appropriate search indices.
The federation architecture is provided by ADL and, in particular,
by ADL's collection discovery services. Federation over OAI is easily
accomplished using off-the-shelf OAI aggregation tools.
created 2004-12; last modified
2009-01-12 15:01