Resources, Versions, and URIs

Contents

Summary recommendation

Introduction

Maintaining persistent associations between URIs and web resources becomes increasingly challenging over time as resources get moved, renamed, and reorganized. Unfortunately, there appears to be no simple technical solution to this problem, for persistence is an outcome, a byproduct of awareness of the problem and commitment to honoring old URIs more than anything else. Nevertheless, choices about how resources are named and how URIs are assigned can make the associations more resilient to change.

For some types of resources that themselves change over time, maintenance of distinct, named resource versions and version history is necessary. Versioning is particularly important for science data, where knowledge of data versions is necessary to support provenance, and accessibility of specific versions is necessary to support reprocessing. The presence of resource versions, particularly if it is possible that older versions may cease to exist, compounds the general problem of persistence.

Two traditions, predating the Web, have emerged in the management of file-oriented science datasets such as those found in the Earth sciences: embedding version indicators in filenames, and putting files online via FTP servers. The filenames, being unique within the containing dataset and program and provider, and typically being derived from intrinsic properties of the data and metadata, persist quite well, but the locations of the files inevitably change over time. These traditions support an informal kind of provenance: given a filename and knowledge of the containing dataset and provider, it is generally possible, with some web searches and detective work, to locate the file and/or related versions and relevant metadata. Provenance tracking in this environment is thus akin to locating a scholarly article given a textual citation found at the end of a printed publication: it can be done, but not automatically or reliably.

This document gives requirements for managing versioned resources and assigning URIs to support resource persistence and persistent, online provenance. If implemented for file-oriented science data as recommended below, science data citations would resemble DOI identifiers, which have supplanted textual citations as the means of easily, unambiguously, and persistently citing scholarly literature.

Caveat: there is no universally agreed-upon definition of what constitutes a "version" of a resource, or if and when two resources represent two versions of the same conceptual resource, nor is there a standard data model for versions. As a consequence, the requirements given here can be viewed equally well as defining a model of versions.

Requirements

Given a web resource having distinct, named versions, here are the minimum requirements to support persistence and provenance:

The rationale for these requirements is as follows.

Apache implementation

The simplest way to publish web resources is to expose a filesystem hierarchy to an HTTP server such as Apache, letting the server map filesystem directories to URL path components and file suffixes to MIME types. To this basic technical approach must be added two types of URL rewriting rules: one to redirect versionless URLs to the most recent version, and another to redirect URLs of nonexistent resource versions to a metadata page.

For example, suppose we are publishing a collection of HDF files with URLs of the form http://host/path/granule.date.version.hdf, where date is an 8-digit integer that changes between granules and version is a single-digit integer. Suppose further that the current version is 5, and that versions 1-3 no longer exist. Then the following rewrite rules suffice:

RewriteEngine on
RewriteBase /path

RewriteRule ^(granule\.\d{8})\.hdf$ $1.5.hdf

RewriteRule ^granule\.\d{8}\.[1-3]\.hdf$ \
  http://otherpath/metadata.html [redirect=permanent]

The first rule adds the most recent version number (5) to a versionless URL in a manner that is entirely transparent to the client. Alternatively, the URL rewriting could be exposed as a 302 Found or 307 Temporary Redirect status code. The second rule specifies that clients requesting versions 1-3 be permanently redirected to the web resource metadata.html.

Persistent identifier implementations

The previous section described a redirection implementation in which the resource URIs were simple URLs. For greater persistence it is often desirable to draw URIs from some kind of persistent identifier scheme that employs lookups and/or redirections to achieve resilience to location changes. The ability to add version-related redirections to such URIs is dependent on the identifier system and its capabilities and restrictions.

Resolving systems that operate on identifier prefixes only, leaving suffixes intact and unchanged, provide no support for version-related redirections, but neither do they hinder them. The Name-to-Thing (N2T) Resolver is an exemplar of this type of service. It redirects naming authorities to hosts, e.g., redirecting http://n2t.info/13030/anythinghttp://www.cdlib.org/anything. Thus any additional redirections can be implemented by the local server or the next resolving system down the line.

Persistent Uniform Resource Locators, or PURLs, similarly resolve prefixes only; specifically, the PURL resolving system operates on the longest matching prefix in its URL database. The HTTP status code to be returned during resolution can be specified as well, and hence our version-related redirections can be directly implemented within the PURL system. Creating PURLs for a collection of files as in the example in the previous section would require that each file (and perhaps each version) be individually registered (if the version-related redirections are to be performed by the PURL system and not a downstream local server). While not necessarily a performance issue, this may prove to be a significant management burden. However, we note that a regular-expression-based registration system for PURLs (i.e., the ability to define PURLs for URI patterns) has recently been proposed.

Digital Object Identifiers, or DOIs, resolve entire identifiers only. (The Handle System, on which the DOI system is based, does define a mechanism for passing along URI suffixes, but it is rather funky, and it is not clear that the DOI system has implemented it.) Nor is it possible to specify the HTTP status code to be returned during URI resolution. As a result, it is likely to be difficult to implement our version-related redirections using DOIs.

The ARK identifier system supports the apparency and aggregation requirements by defining a syntactic marker for version indicators and by mandating implicit relationships between versioned and versionless identifiers. The Noid software, which can be used as a resolver for ARK identifiers (and other types of identifiers as well) does not, as of this writing, support version-related redirections as described above, but such redirections can be set up as part of installing Noid as an Apache external redirection service.

Examples

Managing versioned resources is a common problem. Here's how it's been handled by others.

Further reading

In order of decreasing relevance...

Bruce R. Barkstrom (2003). Data Product Configuration Management and Versioning in Large-Scale Production of Satellite Scientific Data. Bernhard Westfechtel and André van der Hoek (eds.), Software Configuration Management (Springer LNCS 2649): 118-133. doi:10.1007/3-540-39195-9_9

Curt Tilmes (2009). Persistent Identifiers for Earth Science Provenance.

Curt Tilmes and Albert J. Fleig (2008). Provenance Tracking in an Earth Science Data Processing System. Second International Provenance and Annotation Workshop (IPAW 2008) (Salt Lake City, UT; June 17-18, 2008): 221-228. doi:10.1007/978-3-540-89965-5_23

David Adams, Srini Rajagopalan, and Paolo Califiura (2001). Data history.

Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau (2007). The Requirements of Using Provenance in e-Science Experiments. Journal of Grid Computing 5(1) (March 2007): 1-25. doi:10.1007/s10723-006-9055-3

Geoffrey Clemm, Jim Amsden, Tim Ellison, Christopher Kaler, and Jim Whitehead (2002). Versioning Extensions to WebDAV (Web Distributed Authoring and Versioning). IETF RFC 3253.

Reidar Conradi and Bernhard Westfechtel (1998). Version Models for Software Configuration Management. ACM Computing Surveys 30(2) (June 1998): 232-282. doi:10.1145/280277.280280

created 2009-10-20; last modified 2012-05-07 11:31