Mapping Language Tutorial

This is a tutorial introduction to the ADL metadata mapping language.

Contents

Notation

Syntax definitions look like this:

definition

Examples like this:

example

Installation

  1. Install Python if it's not already on your system.
  2. Install PyXML.
  3. Download and unpack the ADL_mapper distribution file (mm.tar.gz), yielding a directory mm of Python modules. New mappings can be developed right in the mm directory, or, the ADL_mapper modules can be placed in any directory in the Python module path.

Overall structure and invocation

Mappings developed using ADL_mapper are written in the Python scripting language. As will be seen, mappings are largely declarative in nature, and can in simple cases involve no procedural code at all. But in all cases some Python knowledge will be required on the part of the mapping developer, certainly the rules of Python syntax at minimum. The Python tutorial is recommended background reading.

A mapping is a Python module having the overall structure:

from ADL_mapper import *
input()

statements

output()

The import statement loads the infrastructure that supports the mapping language. The input statement processes command-line arguments and loads the source metadata. The output statement performs all processing (mapping, conversion, validation, etc.) and then outputs the mapped metadata, which currently consists of the ADL bucket view only. In between the input and output statements may be placed, in addition to arbitrary procedural code, a number of different kinds of statements that govern the mapping and other processing to take place. Strictly speaking, these "statements" are Python function calls (just as the input and output statements are), but because these calls are declarative in nature and can generally be placed in any order, we refer to them as (declarative) statements in this document.

A mapping is invoked from the command line using a command of the form:

python mapping [-t] [-Dparam=value] [input-file]

The source metadata, which must be an XML document, is read from input-file if specified or standard input otherwise. The mapped ADL bucket view is written to standard output. The -D option can be used as many times as desired to specify parameter values that can be retrieved from within the mapping using the getParam statement. The -t option enables error tracebacks.

Background declarations

Buckets, and the vocabularies associated with hierarchical buckets, must be declared before they can be referenced; details are given in Appendix 1: Declaring buckets and vocabularies. However, most mappings will not need to make such background declarations, but instead will be able to import one or more pre-existing modules containing the necessary declarations, as in:

import ADL_buckets

The namespace statement associates a prefix with an XML namespace; the prefix can then be referenced in XPath expressions:

namespace(prefix, uri)
namespace("A", "http://adn.dlese.org")

Mapping fundamentals

The map statement is the principal statement provided by the language. It specifies a query to be performed against the source metadata. When the query is executed (recall that all processing is performed at the end of the Python script; the map statement and other language statements simply describe the processing to be performed), zero or more values are produced in the form of a list of tuples. ADL_mapper passes the tuples through any filter and converter functions specified by the map statement; any (converted) surviving tuples are validated; and lastly, the valid tuples are appropriately formatted, serialized, and included in the output. The map statement's various functionalities (querying; filtering and conversion; validation; encoding) can be performed individually using other language statements and procedural code, but in general mappings will find it most convenient to use the map statement.

map(bucket, query [, field]
    [,
prefilters] [, converters] [, postfilters]
    [,
strict] [, id])
map("adl:titles", "/record/title")

bucket is the name of the bucket to map to. query is either a single string expression or a list of string expressions that query the source metadata. In the simplest cases, a query can be a single constant or a single XPath expression. The query language is described in Queries, below.

The optional field argument identifies the source metadata field in the mapping, which is useful both as documentation and to support field-level searching. A source metadata field is identified by a 2-tuple (name, uri) where name is a human readable name for the field and uri is a URI that uniquely identifies the field. For example, the Dublin Core Title element has human-readable name "[DC] Title" (by convention, the field name is prefixed with an abbreviation for the metadata standard) and URI "http://purl.org/dc/elements/1.1/title". In this case the URI for the element has already been assigned by Dublin Core; in cases where there is no existing URI, it is recommended that Tag URIs be created.

The remaining optional arguments are discussed under Other mapping features, below.

Putting together everything discussed so far, a complete mapping that maps the Dublin Core Title and Creator elements to the adl:titles and adl:originators buckets, respectively, is:

from ADL_mapper import *
input()

import ADL_buckets

namespace("M", "http://example.org/myapp/")
namespace("D", "http://purl.org/dc/elements/1.1/")

map("adl:titles",
    "/M:metadata/D:title",
    ("[DC] Title",
     "http://purl.org/dc/elements/1.1/title"))

map("adl:originators",
    "/M:metadata/D:creator",
    ("[DC] Creator",
     "http://purl.org/dc/elements/1.1/creator"))

output()

Given the source metadata record:

<?xml version="1.0"?>
<metadata xmlns="http://example.org/myapp/"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
  <dc:creator>Sarah Bellum</dc:creator>
  <dc:creator>Sandy Beach</dc:creator>
  <dc:title>Neurosurgery for Dummies</dc:title>
</metadata>

the above mapping will produce this ADL bucket view:

<?xml version='1.0'?>
<!DOCTYPE ADL-bucket-report SYSTEM "...">
<ADL-bucket-report>
  <identifier>collection:holding</identifier>
  <bucket name='adl:titles'>
    <textual-value>
      <field name='[DC] Title'
        uri='http://purl.org/dc/elements/1.1/title'/>
      <text>Neurosurgery for Dummies</text>
    </textual-value>
  </bucket>
  <bucket name='adl:originators'>
    <textual-value>
      <field name='[DC] Creator'
        uri='http://purl.org/dc/elements/1.1/creator'/>
      <text>Sarah Bellum</text>
    </textual-value>
    <textual-value>
      <field name='[DC] Creator'
        uri='http://purl.org/dc/elements/1.1/creator'/>
      <text>Sandy Beach</text>
    </textual-value>
  </bucket>
</ADL-bucket-report>

Bucket types

We glossed over an important detail in the previous section: bucket types and the requirements bucket types place on mappings.

ADL_mapper validates all tuples, and only valid tuples are placed in the output. The bucket type determines what constitutes a valid tuple, and in particular, it determines how many components a tuple may have and the semantics of and any syntactic restrictions on tuple components. Some bucket types, such as the textual type, place little constraint on tuples, but other bucket types, such as the spatial and temporal types, require specific component syntaxes. Thus in creating a mapping, and especially in developing queries and writing filter and converter functions, a good understanding of the bucket type in question is paramount. For background information on bucket types, consult the ADL middleware specifications ADL-bucket-report.dtd and ADL-query.dtd.

The following are the validity requirements of the built-in ADL bucket types:

hierarchical
Accepts a 2-tuple of the form (vocabulary, term) describing a vocabulary term. vocabulary must be one of the vocabularies associated with the bucket and term must be one of the vocabulary's terms. Example tuple:
  ("ADL Object Formats", "Online")
identification
Accepts a 1-tuple of the form (identifier) or a 2-tuple of the form (identifier, namespace). Both identifier and namespace may be arbitrary strings; namespace may be None, in which case the 2-tuple is equivalent to a 1-tuple. Example tuple:
  ("0-201-63274-8", "ISBN")
numeric
Accepts a 1-tuple of the form (value) describing a numeric value in standard floating-point notation, or a 2-tuple of the form (value, unit) describing a numeric value with an associated unit of measure. unit may be None, in which case the 2-tuple is equivalent to a 1-tuple. Example tuple:
  ("3.2", "km")
relational
Accepts a 2-tuple of the form (relation, target) describing a relationship to another collection item. relation may be an arbitrary string. target should be a global ADL object identifier, but this is not validated. Example tuple:
  ("part of", "adl_catalog:314159")
spatial
Accepts a 2-tuple of the form (latitude, longitude) describing a point on the Earth's surface, or a 4-tuple of the form (north, south, east, west) describing a box bounded by the given coordinates. All coordinates must be expressed in signed decimal degrees in standard floating point notation. Latitudes must be expressed as degrees north of the equator; longitudes, as degrees east of the Greenwich meridian and in the range [-180, 180]. A box is considered to cross the ±180° meridian if its east coordinate is less than its west coordinate. The range of all longitudes is described by a west longitude of -180 and an east longitude of 180. Example tuple:
  ("35.7", "-120.5")
temporal
Accepts a 1-tuple of the form (date) describing a single Gregorian calendar date, or a 2-tuple of the form (begin, end) describing a range of calendar dates. Dates must be expressed in ISO 8601 YYYY-MM-DD notation; trailing components may be elided. Example tuple:
  ("1997-07-17", "2005-03")
textual
Accepts any 1-tuple of the form (text). Example tuple:
  ("some text")

Queries

To form tuples from the source metadata, the ADL mapping language provides a simple query language based on the XPath language.

A query consists of either a single term or a list of one or more terms; each term may be a string constant (prefixed with an equals sign ("=")), an absolute XPath expression (distinguished by an initial forward slash ("/")), or a relative XPath expression (anything else). Executing a query produces zero or more tuples of (string) values. With one caveat noted below, each query term contributes one component to the output tuples.

Processing of an absolute XPath expression depends on whether the expression is followed by any relative XPath expressions. In the case when the absolute expression is not, the expression may identify any XML element or attribute, and a tuple will be formed for each such XML element or attribute present in the source metadata. For example:

source metadata:
    <metadata>
      <author>Warren Peace</author>
      <author>Lou Tennant</author>
    </metadata>

query:
    ["/metadata/author"]

result:
    [("Warren Peace",),
     ("Lou Tennant",)]

(The bizarre-looking syntax (value,) in the result above is Python's notation for a 1-tuple.)

A multi-term query may have more than one absolute XPath expression, although this is unlikely to be encountered in practice. In this case each expression is evaluated and contributes a component to the output tuples. The cardinalities of the expressions (that is, the numbers of values they produce) must be equal, else a fatal error results. For example:

source metadata:
    <authors>
      <first>Frank</first>
      <mi>N</mi>
      <last>Stein</last>
      <first>Pete</first>
      <last>Moss</last>
    </authors>

query:
    ["/authors/first", "/authors/last"]

result:
    [("Frank", "Stein"),
     ("Pete", "Moss")]

query:
    ["/authors/first", "/authors/mi", "/authors/last"]

result:
    FATAL ERROR: incommensurable column lengths

An absolute XPath expression may be followed by one or more relative XPath expressions. In this case (here's the caveat alluded to previously), the absolute expression does not contribute a component to the output tuples, but only serves to provide a set of contextual nodes for the interpretation of the relative expressions. For each contextual node the relative expressions are evaluated; each relative expression must produce zero or one values per contextual node, else a fatal error results (unless the join function is used; see below). Analogous to the "outer join" in SQL, if a relative expression produces zero values, None is inserted in the tuple. Thus the cardinality of each relative expression is always equal to the number of contextual nodes. For example:

source metadata:
    <points>
      <point>
        <lat>51.5</lat>
        <lon>-0.1167</lon>
      </point>
      <point>
        <lat>48.8667</lat>
        <lon>2.3333</lon>
      </point>
      <point>
        <lat>90</lat>
      </point>
    </points>

query:
    ["/points/point", "lat", "lon"]

result:
    [("51.5", "-0.1667"),
     ("48.8667", "2.3333"),
     ("90", None)]

(The interested reader may like to try rewriting the example query above that resulted in an "incommensurable column lengths" error using relative XPath expressions. An answer can be found in Tips & tricks, below.)

As a slight extension to XPath, the join(expr,str) function can be used to concatenate multiple values produced by relative expression expr into a single value; the values are separated by string str. For example:

source metadata:
    <authors>
      <name>
        <first>Sherlock</first>
        <last>Holmes</last>
      </name>
      <name>
        <honorific>Dr.</honorific>
        <first>John</first>
        <middle>H.</middle>
        <last>Watson</last>
      </name>
    </authors>

query:
    ["/authors/name", "join(*, ' ')"]

result:
    [("Sherlock Holmes",),
     ("Dr. John H. Watson",)]

Constant terms, which are recognized by having an equals sign ("=") prefix, can be inserted anywhere in a query, and do not affect the processing of XPath expressions. Constants are replicated to match the expressions' cardinality; equivalently, a Cartesian product is performed between constants and components derived from XPath expressions. A query that consists entirely of constants produces a single tuple. For example, here's a revision of the previous latitude/longitude query that includes a constant:

query:
    ["/points/point", "lat", "=test", "lon"]

result:
    [("51.5", "test", "-0.1667"),
     ("48.8667", "test", "2.3333"),
     ("90", "test", None)]

Note that, as part of query processing, XML element and attribute values are canonized before being placed in tuples. Leading and trailing whitespace is removed, and empty and all-whitespace values are converted to None. Tuples consisting entirely of None values are discarded.

Other mapping features

Recall the map statement's syntax:

map(bucket, query [, field]
    [,
prefilters] [, converters] [, postfilters]
    [,
strict] [, id])

So far we have discussed the map statement's overall processing and the bucket, query, and field arguments.

The prefilters, converters, and postfilters arguments specify functions to be applied to tuples before the tuples are submitted for validation and encoding. Each of these arguments may be a single function or a list of zero or more functions. There are two types of functions, converters and filters. Filters are divided into "prefilters," which are called before any converters, and "postfilters," which are called after; within these categories, functions are called in the order specified. Filters and converters have the same profile, but they have slightly different return value semantics and are handled differently by ADL_mapper.

filter(tuple) ⇒ retval
converter(tuple) ⇒ retval

    retval ::= tuple | None

A filter is passed a tuple, and it should return either the same or another tuple, or None. If it returns a tuple, the returned tuple is passed to the next filter in the sequence (corollary: a tuple that reaches validation will have passed through all filters). But if the filter returns None, the mapping of that tuple is abandoned. Thus filters are useful for performing transformations on tuples and for rejecting tuples. Here is an example of a filter that rejects tuples (1-tuples, in this case) that do not appear to be dates:

def weedOutNonDates (v):
    # a filter
    if re.match("\d\d\d\d-\d\d-\d\d$", v[0]):
        return v
    else:
        return None

map("adl:dates",
    ...,
    prefilters=weedOutNonDates)

A converter also should return either the same or another tuple, or None. ADL_mapper passes a tuple under consideration to each converter in turn. If a converter returns a tuple, the remaining converters are ignored and the returned tuple is passed to any postfilters; otherwise, if all converters return None, the original tuple is passed to any postfilters. Converters are thus useful as pattern recognizers. Here is an example of a converter that recognizes dates lacking dashes, and transforms them accordingly:

def insertDashes (v):
    # a converter
    m = re.match("(\d\d\d\d)(\d\d)(\d\d)$", v[0])
    if m:
        return ("%s-%s-%s" %\
            (m.group(1), m.group(2), m.group(3)),)
    else:
        return None

map("adl:dates",
    ...,
    converters=insertDashes)

The strict argument, a boolean, determines the handling of invalid tuples. If True (the default), an invalid tuple causes a fatal error. If False, invalid tuples are simply ignored. The latter mode can be thought of as "opportunistic" mapping.

The id argument assigns an integer ID to the mapping, which makes it easy to reference the mapping from derived mappings (see Derived mappings, below). The ID must be unique among the IDs of all mappings to bucket.

The mapConstant statement is a simplified form of the map statement:

mapConstant(bucket, value [, field] [, id])

value is either a string constant or a tuple of string constants to be mapped; the constants should not be prefixed with equals signs ("="), as they are in queries. The other arguments are as in the map statement. For example, the following two statements are equivalent:

map("adl:formats",
    ["=ADL Object Formats", "=Online"])

mapConstant("adl:formats",
    ("ADL Object Formats", "Online"))

Procedural mappings

The map statement is the most convenient way to express mappings, but the mapping language offers several other statements that break up the map statement's functionality and make it possible to write more procedurally-oriented mappings.

The get statement executes a query and returns a list of zero or more tuples. The present statement does the same, but returns True if and only if the query produces at least one tuple.

get(query) ⇒ [tuple, ...]
present(
query) ⇒ True | False
if present("/metadata/citation/geoform"):
    form = get("/metadata/citation/geoform")[0][0]
    ...

The getSource statement returns the root DOM node of the source metadata document, thus enabling arbitrary XML processing.

getSource() ⇒ node
root = getSource()

The getVocabulary statement returns a vocabulary:

getVocabulary(name) ⇒ vocabulary

    
vocabulary ::= (buckets, termAncestorMap)
    
buckets ::= [bucket, ...]
    
termAncestorMap ::= { term : [term, ...], ... }
getVocabulary("ADL Object Formats") ⇒
    (["adl:formats"],
     { "Online" : [],
       "Image" : ["Online"],
       "TIFF" : ["Image", "Online"],
       ...
     })

name is the vocabulary's name. If there is no such vocabulary, getVocabulary returns None; otherwise, the returned vocabulary is described by a 2-tuple (buckets, termAncestorMap). buckets is a list of one or more buckets with which the vocabulary is associated. termAncestorMap is a dictionary that maps each term in the vocabulary to a list of all of the term's ancestors (i.e., broader terms).

More statements

The requirement and expectation statements place additional checks on the numbers of mappings made to buckets, which can be a useful sanity check when processing large numbers of metadata records. An unsatisfied requirement results in a fatal error; an unsatisfied expectation results in a warning.

requirement(bucket, cardinality)
expectation(
bucket, cardinality)
requirement("adl:titles", "1")

bucket is the bucket to apply the check to. cardinality is the expected number of mappings to the bucket, and must be "1" (exactly one), "1?" (zero or one), "1+" (one or more), or "0+" (any number). For example, if, for a given source metadata document, two values are mapped to the adl:titles bucket, the above requirement will generate the following error message:

FATAL ERROR: mapping requirement not satisfied: required cardinality for bucket 'adl:titles' is '1', got 2 mappings

More generally, the fatal and warning statements can be used by mappings to generate arbitrary fatal errors and warning messages, respectively:

fatal(message)
warning(
message)

The consolidateTextualValues statement changes how multiple textual values mapped from the same metadata field to the same bucket are handled.

consolidateTextualValues(buckets [, separator])
consolidateTextualValues("adl:assigned-terms")
consolidateTextualValues("adl:titles", " ")

buckets is either a single bucket or a list of buckets to which the statement applies. separator is the string used to separate multiple values; if not specified, it defaults to "". For example, given the first of the above example statements, the following unconsolidated mappings:

...
<bucket name='adl:assigned-terms'>
  <textual-value>
    <field name='F1'/>
    <text>oceanography</text>
  </textual-value>
  <textual-value>
    <field name='F2'/>
    <text>technology</text>
  </textual-value>
  <textual-value>
    <field name='F1'/>
    <text>satellite data</text>
  </textual-value>
</bucket>
...

will be consolidated as follows:

...
<bucket name='adl:assigned-terms'>
  <textual-value>
    <field name='F1'/>
    <text>oceanography; satellite data</text>
  </textual-value>
  <textual-value>
    <field name='F2'/>
    <text>technology</text>
  </textual-value>
</bucket>
...

The getParam and setParam statements get and set arbitrary parameters, respectively:

getParam(param) ⇒ value
setParam(
param, value)
x = getParam("collection")
setParam("holding", "1001652")

As mentioned in Overall structure and invocation, parameter values can be specified on the command line as well. Two parameters have special meaning to ADL_mapper: collection and holding are, respectively, the collection name and holding identifier that are inserted in the output <identifier> element. By default, these parameters have values identical to their names.

Derived mappings

A mapping can be derived from another mapping (the "parent" mapping) in the sense that the derived mapping inherits and can augment the parent mapping's statements. This is particularly useful in adapting a generic mapping to the idiosyncrasies of a particular dataset. A derived mapping has the overall structure:

from ADL_mapper import *
input()

import
parent mapping

additional statements

output()

In addition to adding new statements, the derived mapping can override, modify, or undo any statement in the parent mapping. As a general rule, this is accomplished simply by repeating the statement with different arguments. Below we detail some exceptional cases.

Consolidation of textual values can be disabled by specifying None as the separator, as in:

consolidateTextualValues("adl:assigned-terms", None)

Requirements and expectations can be removed by relaxing them. For example, the following statement effectively removes any expectation on the adl:geographic-locations bucket:

expectation("adl:geographic-locations", "0+")

The strict, unmap, and prepend/append/converter/filter family of statements modify existing mappings:

strict(bucket, newStrict [, field] [, id])
unmap(
bucket [, field] [, id])
prependPrefilter(
bucket, filter [, field] [, id])
appendPrefilter(
bucket, filter [, field] [, id])
prependConverter(
bucket, converter [, field] [, id])
appendConverter(
bucket, converter [, field] [, id])
prependPostfilter(
bucket, filter [, field] [, id])
appendPostfilter(
bucket, filter [, field] [, id])
strict("adl:dates", False)

def anotherDateFormat...
appendConverter("adl:dates", anotherDateFormat, id=3)

The strict statement modifies a mapping's strictness setting. The unmap statement removes a mapping entirely. The other statements append or prepend filter or converter functions as suggested by their names. The bucket, field, and id arguments in these statements govern the mapping(s) to which the statement applies. If only bucket is specified, the statement applies to all mappings to that bucket. If field is specified as well, the statement applies only to mappings to bucket from field. If id is specified, the statement applies only to the identified mapping to bucket.

Tips & tricks

Some tips and tricks...

Appendix 1: Declaring buckets and vocabularies

The bucket statement declares a bucket:

bucket(name, type)
bucket("adl:geographic-locations", "spatial")
bucket("dlese:grade-ranges", "hierarchical")

name is the bucket's name; to avoid name clashes, standard procedure is to prefix bucket names as shown in the examples above. type is the bucket's type, and must be one of:

Appendix 2: Defining bucket types describes how additional bucket types can be defined.

One or more vocabularies must be associated with each hierarchical bucket. The vocabulary statement declares a vocabulary and associates it with one or more buckets:

vocabulary(name, buckets, terms)

    
terms ::= [term, ...]
    
term ::= str | (str, [term, ...])
vocabulary("ADL Object Type Thesaurus", "adl:types",
    [("cartographic works",
         ["maps"]),
     ("images",
         [("photographs",
              ["aerial photographs"]),
          ("remote-sensing images",
              ["aerial photographs"])])])

name is the vocabulary's name. buckets is either the name of a single bucket or a list of one or more names of buckets with which the vocabulary is to be associated. terms is a hierarchical listing of the vocabulary's terms. At the top level, the vocabulary is described by a list of zero or more terms. A term is a 2-tuple consisting of the term's name and a list of zero or more narrower terms, each of which is recursively described by a term. Terms that are leaves in the hierarchy can be described by simple names. The example vocabulary above has two top-level terms, cartographic works and images. cartographic works has one narrower term, maps, while images has two narrower terms, photographs and remote-sensing images, both of which have aerial photographs as a narrower term.

Appendix 2: Defining bucket types

The bucketType statement declares a bucket type:

bucketType(name, validator, encoder)
def _validate...
def _encode...
bucketType("spatial", _validate, _encode)

name is the name of the type. validator and encoder are functions that are called by ADL_mapper to respectively validate and encode values of the type.

The validation function should have the profile:

validator(bucket, field, value, strict) ⇒ retval

    
field ::= (name, uri) | None
    
retval ::= (field, value) | None

bucket is the name of the bucket the value is being mapped to and field is either a 2-tuple describing the source metadata field or None; both these arguments are passed in just for error-reporting purposes. value, a tuple, is the value to be validated; it may have any number of components, any or all of which may be None. In addition to validating the value, the function is free to perform any conversions or transformations desired. If the value is valid, the function should return a 2-tuple consisting of the field descriptor and the (possibly transformed) value tuple. If the value is not valid, and if strict, a boolean, is True, the function should call fatal; otherwise, the function should return None.

The encoding function should have the profile:

encoder(document, field, value) ⇒ node

document is the DOM node representing the entire output document. field and value are exactly as returned from the validation function. The function should return a new DOM node suitable for appending to the output document.

The bucketType statement can appear anywhere, but by convention bucket types are defined by modules in the bucket_types directory. Inserting an import statement in bucket_types/__init__.py causes the module (i.e., bucket type) to be automatically loaded.

Statement index

appendConverter
appendPostfilter
appendPrefilter
bucket
bucketType
consolidateTextualValues
expectation
fatal
get
getParam
getSource
getVocabulary
input
map
mapConstant
namespace
output
prependConverter
prependPostfilter
prependPrefilter
present
requirement
setParam
strict
unmap
vocabulary
warning

last modified 2009-11-19 22:34