The sample Bucket99 configuration files represent an experiment in forming an ADL collection by directly mapping existing metadata to the ADL buckets. That is, given a set of items having item-level metadata, we form the collection by mapping the metadata to the ADL buckets with little to no manipulation; the only work required is deciding which metadata fields map to which buckets. In this particular experiment the mapping and configuration were done manually, but we're anticipating the development of future components that will build collections by metadata harvesting, mapping, and ingest processes that are entirely automated.
The collection in this experiment was a set of 2,851 USGS DRGs having FGDC metadata. The item-level FGDC metadata was derived from a single, comprehensive series-level FGDC record for the DRGs combined with a database of 13 small, item-level fields. This is perhaps an unfair experiment—other collections such as the DLESE collections feature metadata that is truly item-level—nevertheless, this type of collection is one that ADL has historically been targeted at and will continue to be.
The following are some problems encountered.
Directly mapping entire metadata fields to textual buckets can cause false hits because the text may contain words that, in the context of discovery, are misleading. For example, the FGDC Abstract field for the DRGs contains this sentence:
The DRG can be used to collect, review, and revise other digital data, especially digital line graphs (DLG).
This is certainly appropriate metadata for a DRG, but when the text
is mapped in its entirety to the adl:subject-related-text
bucket, the result is that a search for the phrase "digital line
graphs" will return every DRG. This problem would seem to be
a fundamental limitation of automated metadata mapping.
As mentioned above, the metadata for this collection was largely derived from series-level metadata, not metadata that is truly specific to the individual items. This leads to another kind of inappropriate text. The FGDC Abstract field also contains the sentence:
The USGS is producing DRG's of the 1:24,000-, 1:24,000/1:25,000-, 1:63,360- (Alaska), 1:100,000-, and 1:250,000-scale topographic map series.
Mapping this sentence to the adl:subject-related-text
bucket means that a search for "Alaska" will return every
DRG. This is a specific instance of a more general problem: there is
no mechanism in ADL for representing series of items.
In bucket mappings, source metadata fields are identified by URIs.
Dublin Core has assigned URIs to its metadata fields (e.g.,
http://purl.org/dc/elements/1.1/creator for the Creator
element), but it's probably safe to say that Dublin Core is the
exception and not the rule. The URIs for the FGDC fields in this
experiment were our invention. They use the tag URI
scheme and refer to FGDC fields by number (1.4.2, for example),
which is not an FGDC sanctioned practice.
Duplication of information. A general problem with
ADL's configuration mechanism (not specific to metadata mapping) is
the overlap and outright duplication between the bucket configuration
related to report generation and that related to query translation.
Consider the configuration for just the
adl:geographic-locations bucket. Here's the report
generation query and template:
SELECT n_b_coord, s_b_coord, e_b_coord,
w_b_coord
FROM
ingest
WHERE
ocode = ?
<bucket name="adl:geographic-locations">
<spatial-value>
<field uri="tag:fgdc.gov,2003:csdgm/1.5.1"
name="[FGDC] Bounding Coordinates"/>
<box>
<north>$main.n_b_coord$</north>
<south>$main.s_b_coord$</south>
<east>$main.e_b_coord$</east>
<west>$main.w_b_coord$</west>
</box>
</spatial-value>
</bucket>
And here's the query translator configuration:
"adl:geographic-locations" : UT.Bucket(
"spatial",
UT.standardSpatialOperators,
P.Adaptor_Constant(
"tag:fgdc.gov,2003:csdgm/1.5.1",
P.Spatial_BoxCoordinatesNoCrossing(
"ingest",
"ocode",
"n_b_coord",
"s_b_coord",
"e_b_coord",
"w_b_coord",
UT.Cardinality("1"))))
Sub-buckets cause duplication problems of a different kind.
Because the Python-based universal query translator has no intrinsic
support for sub-buckets, such bucket relationships must be implemented
manually. In particular, the configurations for the
adl:titles and adl:assigned-terms buckets
must be repeated (but not exactly duplicated, due to slight syntactic
differences) in the configuration for the
adl:subject-related-text bucket. A partial example of
this misery is shown below.
"adl:titles" : UT.Bucket(
"textual",
UT.standardTextualOperators,
P.Adaptor_Constant(
"tag:fgdc.gov,2003:csdgm/1.1/8.4",
P.Textual_LikeSubstring(...)))
"adl:subject-related-text" : UT.Bucket(
"textual",
UT.standardTextualOperators,
P.Adaptor_Concatenation({
"tag:fgdc.gov,2003:csdgm/1.1/8.4" :
P.Textual_LikeSubstring(...),
...}))
These problems of duplication of information will be solved—or rather, hidden—by the universal collection driver because, for at least those collections under its control, configuration will be largely automated and therefore shielded from users. A solution for manually-configured collections could involve the use of some kind of meta-configuration file.
Poor support for vocabularies. Supporting
hierarchical buckets (specifically, adl:types and
adl:formats) will almost always be complicated by the
fact that vocabulary terms used in the native metadata must be mapped
to the terms used by the bucket. For heterogeneous collections,
dynamic mapping mechanisms such as the Adaptor_TermMapping
query translation paradigm must be used. The collection of DRGs in
this experiment is an example of an entirely homogeneous collection,
and hence represents a slightly easier configuration case because type
and format terms can be specified as collection-wide constants.
Nevertheless, configuration support is lacking because there is no
validation of terms (i.e., no verification that the terms used in the
collection configuration are indeed vocabulary terms) and no check
that the terms used in bucket reports agree with those used in query
translation. Furthermore, in configuring query translation, a term
must be listed with all its broader terms. That is, thesaurus
relationships must be replicated inside configuration files, and there
is no enforcement by the thesaurus that its relationships are
respected.
created 2004-10-01; last modified 2009-11-20 09:39