Greg Janée >
Documents >
Resources, and Versions, and Identifiers! Oh, my!
Resources, and Versions, and Identifiers! Oh, my!
The only constant is change.
—Heraclitus
Data publication, management, and citation would all be so much
easier if data never changed, or at least, if it never changed after
publication. But as the Greeks observed so long ago, change is here
to stay. We must accept that data will change, and given that fact,
we are probably better off embracing change rather than avoiding it.
Because the very essence of data citation is identifying what was
referenced at the time it was referenced, we need to be able to put a
name on that referenced quantity, which leads to the requirement of
assigning named versions to data. With versions we are providing the
x that enables somebody to say, "I used version x of
dataset y."
Since versions are ultimately names, the problem of defining
versions is inextricably bound up with the general problem of
identification. Key questions that must be asked when addressing data
versioning and identification include:
- What is being identified by a version? This can be a surprisingly
subtle question. Is a particular set of bits being identified? A
conceptual quantity (to use FRBR
terms, an expression or manifestation)? A location? A conceptual
quantity at a location? For a resource that changes rapidly or
predictably, such as a data stream that accumulates over time, it will
probably be necessary to address the structure of the stream
separately from the content of the stream, and to support versions
and/or citation mechanisms that allow the state of the stream to be
characterized at the time of reference. In any case, the answer to
the question of what is being identified will greatly impact both what
constitutes change (and therefore what constitutes a version) and the
appropriateness of different identifier technologies to identifying
those versions.
- When does a change constitute a new version? Always? Even when
only a typographical error is being corrected? Or, in a hypertext
document, when updating a broken hyperlink? (This is a particularly
difficult case, since updating a hyperlink requires updating the
document, of course, but a URL is really a property of the identifiee,
not the identifier.) In the case of a science dataset, does changing
the format of the data constitute a new version? Reorganizing the
data within a format (e.g., changing from row-major to column-major
order)? Re-computing the data on different floating-point hardware?
Versions are often divided into "major" versions and "minor" versions
to help characterize the magnitude and backward-compatibility of
changes.
- Is each version an independent resource? Or is there one resource
that contains multiple versions? This may seem a purely semantic
distinction, but the question has implications on how the resource is
managed in practice. The W3C
struggled with this question in identifying the HTML specification.
It could have created one HTML resource with many versions (3.1, 4.2,
5, ...), but for manageability it settled on calling HTML3 one
resource (with versions 3.1, 3.2, etc.), HTML4 a separate resource
(with analogous versions 4.1, 4.2, etc.), and continuing on to HTML5
as yet another resource.
So far we have only raised questions, and that's the nature of
dealing with versions: the answers tend to be very
situation-specific. Fortunately, some broad guidelines have
emerged:
- Assign an identifier to each version to support identification and
citation.
- Assign an identifier to the resource as a whole, that is, to the
resource without considering any particular version of the resource.
There are many situations where it is desirable to be able to make a
version-agnostic reference. Consider that, in the text above, we were
able to refer to something called "HTML4" without having to name any
particular version of that resource. What if that were not
possible?
- Provide linkages between the versions, and between the versions
and the resource as a whole.
These guidelines still leave the question of how to actually assign
identifiers to versions unanswered. One approach is to assign a
different, unrelated identifier to each version. For example,
doi:10.1234/FOO might refer to version 1 of a resource and
doi:10.5678/BAR to version 2. Linkages, stored in the resource
versions themselves or externally in a database, can record the
relationships between these identifiers. This approach may be
appropriate in many cases, but it should be recognized that it places
a burden on both the resource maintainer (every link that must be
maintained represents a breakage point) and user (there is no easily
visible or otherwise obvious relationship between the identifiers).
Another approach is to syntactically encode version information in the
identifiers. With this approach, we might start with doi:10.1234/FOO
as a base identifier for the resource, and then append version
information in a visually apparent way. For example,
doi:10.1234/FOO/v1 might refer to version 1, doi:10.1234/FOO/v2 to
version 2, and so forth. And in a logical extension we could then
treat the version-less identifier doi:10.1234/FOO as identifying the
resource as a whole. This is exactly the approach used by the arXiv preprint service.
Resources, versions, identifiers, citations: the issues they
present tend to get bound up in a Gordian knot. Oh, my!
Further reading
ESIP
Interagency Data Stewardship/Citations/Provider Guidelines
DCC
"Cite Datasets and Link to Publications" How-to Guide
Resources, Versions, and
URIs
created 2012-05-15; last modified
2012-05-15 22:33