An Experiment in Metadata Mapping

The sample Bucket99 configuration files represent an experiment in forming an ADL collection by directly mapping existing metadata to the ADL buckets. That is, given a set of items having item-level metadata, we form the collection by mapping the metadata to the ADL buckets with little to no manipulation; the only work required is deciding which metadata fields map to which buckets. In this particular experiment the mapping and configuration were done manually, but we're anticipating the development of future components that will build collections by metadata harvesting, mapping, and ingest processes that are entirely automated.

The collection in this experiment was a set of 2,851 USGS DRGs having FGDC metadata. The item-level FGDC metadata was derived from a single, comprehensive series-level FGDC record for the DRGs combined with a database of 13 small, item-level fields. This is perhaps an unfair experiment—other collections such as the DLESE collections feature metadata that is truly item-level—nevertheless, this type of collection is one that ADL has historically been targeted at and will continue to be.

The following are some problems encountered.

1. Mapping entire metadata text can lead to false hits

Directly mapping entire metadata fields to textual buckets can cause false hits because the text may contain words that, in the context of discovery, are misleading. For example, the FGDC Abstract field for the DRGs contains this sentence:

The DRG can be used to collect, review, and revise other digital data, especially digital line graphs (DLG).

This is certainly appropriate metadata for a DRG, but when the text is mapped in its entirety to the adl:subject-related-text bucket, the result is that a search for the phrase "digital line graphs" will return every DRG. This problem would seem to be a fundamental limitation of automated metadata mapping.

2. Mapping series-level metadata text can lead to false hits

As mentioned above, the metadata for this collection was largely derived from series-level metadata, not metadata that is truly specific to the individual items. This leads to another kind of inappropriate text. The FGDC Abstract field also contains the sentence:

The USGS is producing DRG's of the 1:24,000-, 1:24,000/1:25,000-, 1:63,360- (Alaska), 1:100,000-, and 1:250,000-scale topographic map series.

Mapping this sentence to the adl:subject-related-text bucket means that a search for "Alaska" will return every DRG. This is a specific instance of a more general problem: there is no mechanism in ADL for representing series of items.

3. Poor support for metadata field URIs

In bucket mappings, source metadata fields are identified by URIs. Dublin Core has assigned URIs to its metadata fields (e.g., http://purl.org/dc/elements/1.1/creator for the Creator element), but it's probably safe to say that Dublin Core is the exception and not the rule. The URIs for the FGDC fields in this experiment were our invention. They use the tag URI scheme and refer to FGDC fields by number (1.4.2, for example), which is not an FGDC sanctioned practice.

4. Configuration problems

Duplication of information. A general problem with ADL's configuration mechanism (not specific to metadata mapping) is the overlap and outright duplication between the bucket configuration related to report generation and that related to query translation. Consider the configuration for just the adl:geographic-locations bucket. Here's the report generation query and template:

SELECT n_b_coord, s_b_coord, e_b_coord, w_b_coord
  FROM
    ingest
  WHERE
    ocode = ?

<bucket name="adl:geographic-locations">
  <spatial-value>
    <field uri="tag:fgdc.gov,2003:csdgm/1.5.1"
      name="[FGDC] Bounding Coordinates"/>
    <box>
      <north>$main.n_b_coord$</north>
      <south>$main.s_b_coord$</south>
      <east>$main.e_b_coord$</east>
      <west>$main.w_b_coord$</west>
    </box>
  </spatial-value>
</bucket>

And here's the query translator configuration:

"adl:geographic-locations" : UT.Bucket(
    "spatial",
    UT.standardSpatialOperators,
    P.Adaptor_Constant(
        "tag:fgdc.gov,2003:csdgm/1.5.1",
        P.Spatial_BoxCoordinatesNoCrossing(
            "ingest",
            "ocode",
            "n_b_coord",
            "s_b_coord",
            "e_b_coord",
            "w_b_coord",
            UT.Cardinality("1"))))

Sub-buckets cause duplication problems of a different kind. Because the Python-based universal query translator has no intrinsic support for sub-buckets, such bucket relationships must be implemented manually. In particular, the configurations for the adl:titles and adl:assigned-terms buckets must be repeated (but not exactly duplicated, due to slight syntactic differences) in the configuration for the adl:subject-related-text bucket. A partial example of this misery is shown below.

"adl:titles" : UT.Bucket(
    "textual",
    UT.standardTextualOperators,
    P.Adaptor_Constant(
        "tag:fgdc.gov,2003:csdgm/1.1/8.4",
        P.Textual_LikeSubstring(...)))

"adl:subject-related-text" : UT.Bucket(
    "textual",
    UT.standardTextualOperators,
    P.Adaptor_Concatenation({
        "tag:fgdc.gov,2003:csdgm/1.1/8.4" :
        P.Textual_LikeSubstring(...),
        ...}))

These problems of duplication of information will be solved—or rather, hidden—by the universal collection driver because, for at least those collections under its control, configuration will be largely automated and therefore shielded from users. A solution for manually-configured collections could involve the use of some kind of meta-configuration file.

Poor support for vocabularies. Supporting hierarchical buckets (specifically, adl:types and adl:formats) will almost always be complicated by the fact that vocabulary terms used in the native metadata must be mapped to the terms used by the bucket. For heterogeneous collections, dynamic mapping mechanisms such as the Adaptor_TermMapping query translation paradigm must be used. The collection of DRGs in this experiment is an example of an entirely homogeneous collection, and hence represents a slightly easier configuration case because type and format terms can be specified as collection-wide constants. Nevertheless, configuration support is lacking because there is no validation of terms (i.e., no verification that the terms used in the collection configuration are indeed vocabulary terms) and no check that the terms used in bucket reports agree with those used in query translation. Furthermore, in configuring query translation, a term must be listed with all its broader terms. That is, thesaurus relationships must be replicated inside configuration files, and there is no enforcement by the thesaurus that its relationships are respected.

created 2004-10-01; last modified 2009-11-20 09:39