This is a tutorial introduction to the ADL metadata mapping language.
Syntax definitions look like this:
definition
Examples like this:
example
mm of Python modules. New mappings can be developed
right in the mm directory, or, the ADL_mapper modules can
be placed in any directory in the Python module path.Mappings developed using ADL_mapper are written in the Python scripting language. As will be seen, mappings are largely declarative in nature, and can in simple cases involve no procedural code at all. But in all cases some Python knowledge will be required on the part of the mapping developer, certainly the rules of Python syntax at minimum. The Python tutorial is recommended background reading.
A mapping is a Python module having the overall structure:
from ADL_mapper import *statements
input()
output()
The import statement loads the infrastructure that
supports the mapping language. The input statement
processes command-line arguments and loads the source metadata. The
output statement performs all processing (mapping,
conversion, validation, etc.) and then outputs the mapped metadata,
which currently consists of the ADL bucket view only. In between the
input and output statements may be placed,
in addition to arbitrary procedural code, a number of different kinds
of statements that govern the mapping and other processing to take
place. Strictly speaking, these "statements" are Python function
calls (just as the input and output
statements are), but because these calls are declarative in nature and
can generally be placed in any order, we refer to them as
(declarative) statements in this document.
A mapping is invoked from the command line using a command of the form:
pythonmapping[-t] [-Dparam=value] [input-file]
The source metadata, which must be an XML document, is read from
input-file if specified or standard input otherwise. The
mapped ADL bucket view is written to standard output. The
-D option can be used as many times as desired to specify
parameter values that can be retrieved from within the mapping using
the getParam statement. The
-t option enables error tracebacks.
Buckets, and the vocabularies associated with hierarchical buckets, must be declared before they can be referenced; details are given in Appendix 1: Declaring buckets and vocabularies. However, most mappings will not need to make such background declarations, but instead will be able to import one or more pre-existing modules containing the necessary declarations, as in:
import ADL_buckets
The namespace statement
associates a prefix with an XML namespace; the prefix can then be
referenced in XPath expressions:
namespace(prefix,uri)
namespace("A", "http://adn.dlese.org")
The map statement is the
principal statement provided by the language. It specifies a query to
be performed against the source metadata. When the query is executed
(recall that all processing is performed at the end of the Python
script; the map statement and other language statements
simply describe the processing to be performed), zero or more values
are produced in the form of a list of tuples. ADL_mapper passes the
tuples through any filter and converter functions specified by the
map statement; any (converted) surviving tuples are
validated; and lastly, the valid tuples are appropriately formatted,
serialized, and included in the output. The map
statement's various functionalities (querying; filtering and
conversion; validation; encoding) can be performed individually using
other language statements and procedural code, but in general mappings
will find it most convenient to use the map
statement.
map(bucket,query[,field]prefilters
[,] [,converters] [,postfilters]strict
[,] [,id])
map("adl:titles", "/record/title")
bucket is the name of the bucket to map to. query is either a single string expression or a list of string expressions that query the source metadata. In the simplest cases, a query can be a single constant or a single XPath expression. The query language is described in Queries, below.
The optional field argument identifies the source metadata
field in the mapping, which is useful both as documentation and to
support field-level searching. A source metadata field is identified
by a 2-tuple (name, uri) where name is a human
readable name for the field and uri is a URI that uniquely
identifies the field. For example, the Dublin Core Title
element has human-readable name "[DC] Title" (by
convention, the field name is prefixed with an abbreviation for the
metadata standard) and URI
"http://purl.org/dc/elements/1.1/title". In this case
the URI for the element has already been assigned by Dublin Core; in
cases where there is no existing URI, it is recommended that Tag URIs be
created.
The remaining optional arguments are discussed under Other mapping features, below.
Putting together everything discussed so far, a complete mapping
that maps the Dublin Core Title
and Creator
elements to the adl:titles and
adl:originators buckets, respectively, is:
from ADL_mapper import *
input()
import ADL_buckets
namespace("M", "http://example.org/myapp/")
namespace("D", "http://purl.org/dc/elements/1.1/")
map("adl:titles",
"/M:metadata/D:title",
("[DC] Title",
"http://purl.org/dc/elements/1.1/title"))
map("adl:originators",
"/M:metadata/D:creator",
("[DC] Creator",
"http://purl.org/dc/elements/1.1/creator"))
output()
Given the source metadata record:
<?xml version="1.0"?>
<metadata xmlns="http://example.org/myapp/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:creator>Sarah Bellum</dc:creator>
<dc:creator>Sandy Beach</dc:creator>
<dc:title>Neurosurgery for Dummies</dc:title>
</metadata>
the above mapping will produce this ADL bucket view:
<?xml version='1.0'?>
<!DOCTYPE ADL-bucket-report SYSTEM "...">
<ADL-bucket-report>
<identifier>collection:holding</identifier>
<bucket name='adl:titles'>
<textual-value>
<field name='[DC] Title'
uri='http://purl.org/dc/elements/1.1/title'/>
<text>Neurosurgery for Dummies</text>
</textual-value>
</bucket>
<bucket name='adl:originators'>
<textual-value>
<field name='[DC] Creator'
uri='http://purl.org/dc/elements/1.1/creator'/>
<text>Sarah Bellum</text>
</textual-value>
<textual-value>
<field name='[DC] Creator'
uri='http://purl.org/dc/elements/1.1/creator'/>
<text>Sandy Beach</text>
</textual-value>
</bucket>
</ADL-bucket-report>
We glossed over an important detail in the previous section: bucket types and the requirements bucket types place on mappings.
ADL_mapper validates all tuples, and only valid tuples are placed in the output. The bucket type determines what constitutes a valid tuple, and in particular, it determines how many components a tuple may have and the semantics of and any syntactic restrictions on tuple components. Some bucket types, such as the textual type, place little constraint on tuples, but other bucket types, such as the spatial and temporal types, require specific component syntaxes. Thus in creating a mapping, and especially in developing queries and writing filter and converter functions, a good understanding of the bucket type in question is paramount. For background information on bucket types, consult the ADL middleware specifications ADL-bucket-report.dtd and ADL-query.dtd.
The following are the validity requirements of the built-in ADL bucket types:
hierarchical ("ADL Object Formats", "Online")identificationNone, in which
case the 2-tuple is equivalent to a 1-tuple. Example tuple: ("0-201-63274-8", "ISBN")numericNone, in which case the 2-tuple is equivalent to a
1-tuple. Example tuple: ("3.2", "km")relational ("part of", "adl_catalog:314159")spatial ("35.7", "-120.5")temporal ("1997-07-17", "2005-03")textual ("some text")To form tuples from the source metadata, the ADL mapping language provides a simple query language based on the XPath language.
A query consists of either a single term or a list of one or more
terms; each term may be a string constant (prefixed with an equals
sign ("=")), an absolute XPath expression (distinguished
by an initial forward slash ("/")), or a relative XPath
expression (anything else). Executing a query produces zero or more
tuples of (string) values. With one caveat noted below, each query
term contributes one component to the output tuples.
Processing of an absolute XPath expression depends on whether the expression is followed by any relative XPath expressions. In the case when the absolute expression is not, the expression may identify any XML element or attribute, and a tuple will be formed for each such XML element or attribute present in the source metadata. For example:
source metadata:query:
<metadata>
<author>Warren Peace</author>
<author>Lou Tennant</author>
</metadata>
result:
["/metadata/author"]
[("Warren Peace",),
("Lou Tennant",)]
(The bizarre-looking syntax
(value,) in the result above is
Python's notation for a 1-tuple.)
A multi-term query may have more than one absolute XPath expression, although this is unlikely to be encountered in practice. In this case each expression is evaluated and contributes a component to the output tuples. The cardinalities of the expressions (that is, the numbers of values they produce) must be equal, else a fatal error results. For example:
source metadata:query:
<authors>
<first>Frank</first>
<mi>N</mi>
<last>Stein</last>
<first>Pete</first>
<last>Moss</last>
</authors>
result:
["/authors/first", "/authors/last"]
query:
[("Frank", "Stein"),
("Pete", "Moss")]
result:
["/authors/first", "/authors/mi", "/authors/last"]
FATAL ERROR: incommensurable column lengths
An absolute XPath expression may be followed by one or more
relative XPath expressions. In this case (here's the caveat alluded
to previously), the absolute expression does not contribute a
component to the output tuples, but only serves to provide a set of
contextual nodes for the interpretation of the relative expressions.
For each contextual node the relative expressions are evaluated; each
relative expression must produce zero or one values per contextual
node, else a fatal error results (unless the join
function is used; see below). Analogous to the "outer join" in SQL,
if a relative expression produces zero values, None is
inserted in the tuple. Thus the cardinality of each relative
expression is always equal to the number of contextual nodes. For
example:
source metadata:query:
<points>
<point>
<lat>51.5</lat>
<lon>-0.1167</lon>
</point>
<point>
<lat>48.8667</lat>
<lon>2.3333</lon>
</point>
<point>
<lat>90</lat>
</point>
</points>
result:
["/points/point", "lat", "lon"]
[("51.5", "-0.1667"),
("48.8667", "2.3333"),
("90", None)]
(The interested reader may like to try rewriting the example query above that resulted in an "incommensurable column lengths" error using relative XPath expressions. An answer can be found in Tips & tricks, below.)
As a slight extension to XPath, the
join(expr,str)
function can be used to concatenate multiple values produced by
relative expression expr into a single value; the values are
separated by string str. For example:
source metadata:query:
<authors>
<name>
<first>Sherlock</first>
<last>Holmes</last>
</name>
<name>
<honorific>Dr.</honorific>
<first>John</first>
<middle>H.</middle>
<last>Watson</last>
</name>
</authors>
result:
["/authors/name", "join(*, ' ')"]
[("Sherlock Holmes",),
("Dr. John H. Watson",)]
Constant terms, which are recognized by having an equals sign
("=") prefix, can be inserted anywhere in a query, and do
not affect the processing of XPath expressions. Constants are
replicated to match the expressions' cardinality; equivalently, a
Cartesian product is performed between constants and components
derived from XPath expressions. A query that consists entirely of
constants produces a single tuple. For example, here's a revision of
the previous latitude/longitude query that includes a constant:
query:result:
["/points/point", "lat", "=test", "lon"]
[("51.5", "test", "-0.1667"),
("48.8667", "test", "2.3333"),
("90", "test", None)]
Note that, as part of query processing, XML element and attribute
values are canonized before being placed in tuples. Leading and
trailing whitespace is removed, and empty and all-whitespace values
are converted to None. Tuples consisting entirely of
None values are discarded.
Recall the map statement's syntax:
map(bucket,query[,field]prefilters
[,] [,converters] [,postfilters]strict
[,] [,id])
So far we have discussed the map statement's overall
processing and the bucket, query, and field
arguments.
The prefilters, converters, and postfilters arguments specify functions to be applied to tuples before the tuples are submitted for validation and encoding. Each of these arguments may be a single function or a list of zero or more functions. There are two types of functions, converters and filters. Filters are divided into "prefilters," which are called before any converters, and "postfilters," which are called after; within these categories, functions are called in the order specified. Filters and converters have the same profile, but they have slightly different return value semantics and are handled differently by ADL_mapper.
filter(tuple) ⇒retval
converter(tuple) ⇒retval
retval::=tuple| None
A filter is passed a tuple, and it should return either the same or
another tuple, or None. If it returns a tuple, the
returned tuple is passed to the next filter in the sequence
(corollary: a tuple that reaches validation will have passed through
all filters). But if the filter returns None, the
mapping of that tuple is abandoned. Thus filters are useful for
performing transformations on tuples and for rejecting tuples. Here
is an example of a filter that rejects tuples (1-tuples, in this case)
that do not appear to be dates:
def weedOutNonDates (v):
# a filter
if re.match("\d\d\d\d-\d\d-\d\d$", v[0]):
return v
else:
return None
map("adl:dates",
...,
prefilters=weedOutNonDates)
A converter also should return either the same or another tuple, or
None. ADL_mapper passes a tuple under consideration to
each converter in turn. If a converter returns a tuple, the remaining
converters are ignored and the returned tuple is passed to any
postfilters; otherwise, if all converters return None,
the original tuple is passed to any postfilters. Converters are thus
useful as pattern recognizers. Here is an example of a converter that
recognizes dates lacking dashes, and transforms them accordingly:
def insertDashes (v):
# a converter
m = re.match("(\d\d\d\d)(\d\d)(\d\d)$", v[0])
if m:
return ("%s-%s-%s" %\
(m.group(1), m.group(2), m.group(3)),)
else:
return None
map("adl:dates",
...,
converters=insertDashes)
The strict argument, a boolean, determines the handling of
invalid tuples. If True (the default), an invalid tuple
causes a fatal error. If False, invalid tuples are
simply ignored. The latter mode can be thought of as "opportunistic"
mapping.
The id argument assigns an integer ID to the mapping, which makes it easy to reference the mapping from derived mappings (see Derived mappings, below). The ID must be unique among the IDs of all mappings to bucket.
The mapConstant
statement is a simplified form of the map statement:
mapConstant(bucket,value[,field] [,id])
value is either a string constant or a tuple of string
constants to be mapped; the constants should not be prefixed
with equals signs ("="), as they are in queries. The
other arguments are as in the map statement. For
example, the following two statements are equivalent:
map("adl:formats",
["=ADL Object Formats", "=Online"])
mapConstant("adl:formats",
("ADL Object Formats", "Online"))
The map statement is the most convenient way to
express mappings, but the mapping language offers several other
statements that break up the map statement's
functionality and make it possible to write more procedurally-oriented
mappings.
The get statement
executes a query and returns a list of zero or more tuples. The
present statement does the same, but returns
True if and only if the query produces at least one
tuple.
get(query) ⇒ [tuple, ...]query
present() ⇒ True | False
if present("/metadata/citation/geoform"):
form = get("/metadata/citation/geoform")[0][0]
...
The getSource statement
returns the root DOM node of the source metadata document, thus
enabling arbitrary XML processing.
getSource() ⇒ node
root = getSource()
The getVocabulary
statement returns a vocabulary:
getVocabulary(name) ⇒vocabularyvocabulary
::= (buckets,termAncestorMap)buckets
::= [bucket, ...]termAncestorMap
::= {term: [term, ...], ... }
getVocabulary("ADL Object Formats") ⇒
(["adl:formats"],
{ "Online" : [],
"Image" : ["Online"],
"TIFF" : ["Image", "Online"],
...
})
name is the vocabulary's name. If there is no such
vocabulary, getVocabulary returns None;
otherwise, the returned vocabulary is described by a 2-tuple
(buckets, termAncestorMap). buckets is a list of
one or more buckets with which the vocabulary is associated.
termAncestorMap is a dictionary that maps each term in the
vocabulary to a list of all of the term's ancestors (i.e., broader
terms).
The requirement and
expectation statements place additional checks on the
numbers of mappings made to buckets, which can be a useful sanity
check when processing large numbers of metadata records. An
unsatisfied requirement results in a fatal error; an unsatisfied
expectation results in a warning.
requirement(bucket,cardinality)bucket
expectation(,cardinality)
requirement("adl:titles", "1")
bucket is the bucket to apply the check to.
cardinality is the expected number of mappings to the bucket,
and must be "1" (exactly one), "1?" (zero or
one), "1+" (one or more), or "0+" (any
number). For example, if, for a given source metadata document, two
values are mapped to the adl:titles bucket, the above
requirement will generate the following error message:
FATAL ERROR: mapping requirement not satisfied: required
cardinality for bucket 'adl:titles' is '1', got 2 mappings
More generally, the fatal and
warning statements can be used by mappings to generate
arbitrary fatal errors and warning messages, respectively:
fatal(message)message
warning()
The
consolidateTextualValues statement changes how multiple
textual values mapped from the same metadata field to the same bucket
are handled.
consolidateTextualValues(buckets[,separator])
consolidateTextualValues("adl:assigned-terms")
consolidateTextualValues("adl:titles", " ")
buckets is either a single bucket or a list of buckets to
which the statement applies. separator is the string used to
separate multiple values; if not specified, it defaults to
"; ". For example, given the first of the above
example statements, the following unconsolidated mappings:
...
<bucket name='adl:assigned-terms'>
<textual-value>
<field name='F1'/>
<text>oceanography</text>
</textual-value>
<textual-value>
<field name='F2'/>
<text>technology</text>
</textual-value>
<textual-value>
<field name='F1'/>
<text>satellite data</text>
</textual-value>
</bucket>
...
will be consolidated as follows:
...
<bucket name='adl:assigned-terms'>
<textual-value>
<field name='F1'/>
<text>oceanography; satellite data</text>
</textual-value>
<textual-value>
<field name='F2'/>
<text>technology</text>
</textual-value>
</bucket>
...
The getParam and
setParam statements get and set arbitrary parameters,
respectively:
getParam(param) ⇒valueparam
setParam(,value)
x = getParam("collection")
setParam("holding", "1001652")
As mentioned in Overall structure and
invocation, parameter values can be specified on the command line
as well. Two parameters have special meaning to ADL_mapper:
collection and holding are, respectively,
the collection name and holding identifier that are inserted in the
output <identifier> element. By default, these
parameters have values identical to their names.
A mapping can be derived from another mapping (the "parent" mapping) in the sense that the derived mapping inherits and can augment the parent mapping's statements. This is particularly useful in adapting a generic mapping to the idiosyncrasies of a particular dataset. A derived mapping has the overall structure:
from ADL_mapper import *parent mapping
input()
importadditional statements
output()
In addition to adding new statements, the derived mapping can override, modify, or undo any statement in the parent mapping. As a general rule, this is accomplished simply by repeating the statement with different arguments. Below we detail some exceptional cases.
Consolidation of textual values can be disabled by specifying
None as the separator, as in:
consolidateTextualValues("adl:assigned-terms", None)
Requirements and expectations can be removed by relaxing them. For
example, the following statement effectively removes any expectation
on the adl:geographic-locations bucket:
expectation("adl:geographic-locations", "0+")
The strict,
unmap, and prepend/append/converter/filter
family of statements modify existing mappings:
strict(bucket,newStrict[,field] [,id])bucket
unmap([,field] [,id])bucket
prependPrefilter(,filter[,field] [,id])bucket
appendPrefilter(,filter[,field] [,id])bucket
prependConverter(,converter[,field] [,id])bucket
appendConverter(,converter[,field] [,id])bucket
prependPostfilter(,filter[,field] [,id])bucket
appendPostfilter(,filter[,field] [,id])
strict("adl:dates", False)
def anotherDateFormat...
appendConverter("adl:dates", anotherDateFormat, id=3)
The strict statement modifies a mapping's strictness
setting. The unmap statement removes a mapping entirely.
The other statements append or prepend filter or converter functions
as suggested by their names. The bucket, field, and
id arguments in these statements govern the mapping(s) to which
the statement applies. If only bucket is specified, the
statement applies to all mappings to that bucket. If field is
specified as well, the statement applies only to mappings to
bucket from field. If id is specified, the
statement applies only to the identified mapping to bucket.
Some tips and tricks...
None). So instead of this:
def addPrefix (v):
return "prefix"+v
write this:
def addPrefix (v):
return ("prefix"+v[0],)
<metadata>
<subject-keywords>
<thesaurus>T1</thesaurus>
<keywords>
<keyword>K1.1</keyword>
<keyword>K1.2</keyword>
</keywords>
</subject-keywords>
<subject-keywords>
<thesaurus>T2</thesaurus>
<keywords>
<keyword>K2.1</keyword>
<keyword>K2.2</keyword>
</keywords>
</subject-keywords>
</metadata>
Suppose we would like to join keywords with their thesauri to form the
following tuples:
("K1.1", "T1")
("K1.2", "T1")
("K2.1", "T2")
("K2.2", "T2")
The trick is to use an absolute XPath expression that selects the
lower-level nodes (in this case, the keywords) together with two
relative XPath expressions, one of which retrieves the node content
itself (i.e., the keyword) and the other of which traverses upward to
retrieve the associated higher-level information (i.e., the
thesaurus):
["/metadata/subject-keywords/keywords/keyword",
".",
"../../thesaurus"]
<authors>
<first>Frank</first>
<mi>N</mi>
<last>Stein</last>
<first>Pete</first>
<last>Moss</last>
</authors>
We can't use three absolute XPath expressions since the expressions
have different cardinalities. Nor can we use "/authors"
as a context-defining absolute XPath expression, together with three
relative XPath expressions for the components, because of the multiple
<first> and <last> subelements.
Instead, we use an absolute XPath expression that selects one of the
components, together with relative sibling expressions. The following
query assumes that <first> and
<last> are required, and that only the
<mi> subelement is optional:
["/authors/first",
".",
"following-sibling::*[1][self::mi]",
"following-sibling::last[1]"]
The bucket statement declares
a bucket:
bucket(name,type)
bucket("adl:geographic-locations", "spatial")
bucket("dlese:grade-ranges", "hierarchical")
name is the bucket's name; to avoid name clashes, standard procedure is to prefix bucket names as shown in the examples above. type is the bucket's type, and must be one of:
hierarchicalidentificationnumericrelationalspatialtemporaltextualAppendix 2: Defining bucket types describes how additional bucket types can be defined.
One or more vocabularies must be
associated with each hierarchical bucket. The vocabulary
statement declares a vocabulary and associates it with one or more
buckets:
vocabulary(name,buckets,terms)terms
::= [term, ...]term
::=str| (str, [term, ...])
vocabulary("ADL Object Type Thesaurus", "adl:types",
[("cartographic works",
["maps"]),
("images",
[("photographs",
["aerial photographs"]),
("remote-sensing images",
["aerial photographs"])])])
name is the vocabulary's name. buckets is either the
name of a single bucket or a list of one or more names of buckets with
which the vocabulary is to be associated. terms is a
hierarchical listing of the vocabulary's terms. At the top level, the
vocabulary is described by a list of zero or more terms. A
term is a 2-tuple consisting of the term's name and a list of
zero or more narrower terms, each of which is recursively described by
a term. Terms that are leaves in the hierarchy can be
described by simple names. The example vocabulary above has two
top-level terms, cartographic works and
images. cartographic works has one narrower
term, maps, while images has two narrower
terms, photographs and remote-sensing
images, both of which have aerial photographs as a
narrower term.
The bucketType statement
declares a bucket type:
bucketType(name,validator,encoder)
def _validate...
def _encode...
bucketType("spatial", _validate, _encode)
name is the name of the type. validator and encoder are functions that are called by ADL_mapper to respectively validate and encode values of the type.
The validation function should have the profile:
validator(bucket,field,value,strict) ⇒retvalfield
::= (name,uri) | Noneretval
::= (field,value) | None
bucket is the name of the bucket the value is being mapped
to and field is either a 2-tuple describing the source metadata
field or None; both these arguments are passed in just
for error-reporting purposes. value, a tuple, is the value to
be validated; it may have any number of components, any or all of
which may be None. In addition to validating the value,
the function is free to perform any conversions or transformations
desired. If the value is valid, the function should return a 2-tuple
consisting of the field descriptor and the (possibly transformed)
value tuple. If the value is not valid, and if strict, a
boolean, is True, the function should call fatal; otherwise, the function should
return None.
The encoding function should have the profile:
encoder(document,field,value) ⇒node
document is the DOM node representing the entire output document. field and value are exactly as returned from the validation function. The function should return a new DOM node suitable for appending to the output document.
The bucketType statement can appear anywhere, but by
convention bucket types are defined by modules in the
bucket_types directory. Inserting an import statement in
bucket_types/__init__.py causes the module (i.e., bucket
type) to be automatically loaded.
appendConverter
appendPostfilter
appendPrefilter
bucket
bucketType
consolidateTextualValues
expectation
fatal
get
getParam
getSource
getVocabulary
input
map
mapConstant
namespace
output
prependConverter
prependPostfilter
prependPrefilter
present
requirement
setParam
strict
unmap
vocabulary
warning
last modified 2009-11-19 22:34