The ADL Thesaurus Protocol

Greg Janée, Satoshi Ikeda, Linda L. Hill
Alexandria Digital Library Project

Version 1.0

Contents

Introduction

This document describes an XML- and HTTP-based protocol for accessing thesauri: structured, controlled vocabularies of words and phrases that represent conceptual categories.

The protocol is intended to allow programmatic clients to easily access and utilize existing thesauri, and thus the services offered by the protocol are oriented around querying thesauri and navigating within thesauri. The protocol does not support creation, maintenance, or sharing of thesauri, or mapping between thesauri.

Definitions

The protocol's model of a thesaurus closely follows that of ANSI/NISO Z39.19-1993: Guidelines for the Construction, Format, and Management of Monolingual Thesauri.

A thesaurus is a set of terms and a set of standardized, reciprocal relations on those terms.

A term is a word or phrase that represents a conceptual category. A term may have an associated human-readable description, or scope note, that defines the concept represented by the term and indicates the term's intended usage. Other, arbitrary information may also be associated with a term, but such information is outside the protocol's scope.

There are two varieties of terms, preferred (or valid) and nonpreferred (or invalid or lead-in). Preferred terms participate in all the relations described below; nonpreferred terms participate in the equivalence relations only.

A pair of reciprocal hierarchical relations is the primary means by which thesauri are structured. The narrower (NT) relation relates a preferred term P to another preferred term C that is in some sense a subset of P: as suggested by Z39.19, the concept represented by C may be more specific than that of P, or C may be a component of the whole represented by P, or C may be an instance of the general class represented by P. The narrower relation must be non-reflexive (a term must not be narrower than itself), non-symmetric (two terms must not be mutually narrower than each other), and non-transitive (the narrower relation is logically transitive, that is, if G is narrower than C and C is in turn narrower than P then G is logically narrower than P, but transitive closures must not be reflected in the protocol; rather, they must be left to the client to deduce from first-order relations). The broader (BT) relation is the reciprocal of the narrower relation. A preferred term may be related to any number of broader and narrower terms. The directed graph induced by the narrower relation (equivalently, the broader relation) must be acyclic.

The related (RT) relation relates a preferred term P to another preferred term Q that in some sense intersects P: the concepts represented by P and Q may overlap, or P and Q may be suggestive of each other. The relation must be non-reflexive (a term must not be related to itself), symmetric (if P is related to Q then Q must be related to P), and transitive (if P is related to Q and Q is in turn related to R, then P must be related to R). A preferred term may be related by the related relation to any number of other preferred terms.

A pair of reciprocal equivalence relations ties equivalent terms together. The use-instead (USE) relation maps a nonpreferred term N to a preferred term P that is equivalent to N and that has been designated by the thesaurus as the preferred or canonical term to use in place of N. The used-for (UF) relation is the reciprocal relation that maps P to N. Every nonpreferred term N must be related to at least one preferred term; if more than one, the entire set of N's relations can optionally be designated as a conjunction if N is equivalent to the logical conjunction of the preferred terms.

XML formats

Eight XML formats are utilized by the protocol. The XML elements defined below all reside in namespace "http://www.alexandria.ucsb.edu/thesaurus", but for brevity we elide namespace declarations in this section. Complete examples that include namespace declarations are given under Examples, below. An XML DTD that defines the XML formats can be found in thesaurus-protocol.dtd; thesaurus-protocol.xsd is an equivalent XML schema.

<properties>

Describes overall properties of the thesaurus: the thesaurus's name and version; a human-readable description (which should include the thesaurus's scope and purpose as well as details on how the thesaurus implements the protocol); an indication of which query operators the thesaurus supports; and the URL of the thesaurus's XML schema for extended term descriptions. All properties but the supported query operators are optional.

<!ELEMENT properties (name?, version?, description?,
  query-operators, extended-schema?)>

<!ELEMENT name (#PCDATA)>
<!ELEMENT version (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT extended-schema (#PCDATA)>

<!ELEMENT query-operators EMPTY>
  <!ATTLIST query-operators
    equals             (true | false) #REQUIRED
    contains-all-words (true | false) #REQUIRED
    contains-any-words (true | false) #REQUIRED
    matches-regexp     (true | false) #REQUIRED>

For example:

<properties>
  <name>ADL Feature Type Thesaurus</name>
  <version>1.4</version>
  <description>Thesaurus for...</description>
  <query-operators
    equals="true"
    contains-all-words="true"
    contains-any-words="true"
    matches-regexp="false"/>
  <extended-schema>http://...</extended-schema>
</properties>
<term>

Briefly describes a term by its name and preferredness.

<!ELEMENT term (#PCDATA)>
  <!ATTLIST term
    preferred (true | false) "true">

Examples:

<term>rivers</term>

<term preferred="false">riverbanks</term>
<term-description>

More fully describes a term. In addition to the attributes described under <term> above, the description includes the term's immediate relationships to other terms in the thesaurus and zero or more human-readable descriptive notes. The <note> element's type attribute can be used to indicate the type of note (scope note, historical note, etc.). The <use-instead> element must be employed if and only if the term is nonpreferred.

<!ELEMENT term-description (term, note*,
  ((broader, narrower, used-for, related) |
   use-instead))>

<!ELEMENT note (#PCDATA)>
  <!ATTLIST note
    type CDATA #IMPLIED>

<!ELEMENT broader (term*)>
<!ELEMENT narrower (term*)>
<!ELEMENT used-for (term*)>
<!ELEMENT related (term*)>

<!ELEMENT use-instead (term+)>
  <!ATTLIST use-instead
    conjunction (true | false) "false">

Example description of a preferred term:

<term-description>
  <term>rivers</term>
  <note type="scope note">Flowing water...</note>
  <broader>
    <term>streams</term>
  </broader>
  <narrower>
    <term>bends (river)</term>
    <term>rapids</term>
    <term>waterfalls</term>
  </narrower>
  <used-for>
    <term preferred="false">rios</term>
  </used-for>
  <related>
    <term>channels</term>
    <term>guts</term>
  </related>
</term-description>

Example description of a nonpreferred term:

<term-description>
  <term preferred="false">rios</term>
  <note type="scope note">Agua...</note>
  <use-instead>
    <term>rivers</term>
  </use-instead>
</term-description>

Example description of a nonpreferred term that is equivalent to a conjunction of preferred terms:

<term-description>
  <term preferred="false">dry stream beds</term>
  <use-instead conjunction="true">
    <term>streams</term>
    <term>historical sites</term>
  </use-instead>
</term-description>
<extended>

An optional, thesaurus-specific format that describes a single term. The format is undefined by the protocol; the only requirement is that the report's structure be described by an XML schema, and that the URL of that schema be returned by the thesaurus's get-properties service.

<!ELEMENT extended ANY>
<list>

A list of zero or more terms.

<!ELEMENT list (term* | term-description* |
  extended*)>

For example:

<list>
  <term>rivers</term>
  <term preferred="false">river bends</term>
</list>
<hierarchy>

Describes the hierarchy of terms above (broader than) or below (narrower than) a starting preferred term, including the starting term itself. The hierarchy is indicated by the nesting of XML elements. Specifically, each <node> element N describes a term, and the <node> elements nested immediately within N indicate the term's immediate broader terms or immediate narrower terms, and so on recursively. If a term appears multiple times in the hierarchy, each subsequent appearance must be indicated by a <noderef> element that refers back to the first occurrence's node.

The direction attribute indicates the direction of the hierarchy. The max-levels attribute, an integer, is an upper bound on the number of levels in the hierarchy (using the convention that zero levels corresponds to just the starting term). A negative value indicates that the hierarchy is unbounded.

<!ELEMENT hierarchy (node)>
  <!ATTLIST hierarchy
    direction (broader | narrower) #REQUIRED
    max-levels CDATA #REQUIRED>

<!ELEMENT node ((term | term-description | extended),
  (node | noderef)*)>
  <!ATTLIST node
    id ID #IMPLIED>

<!ELEMENT noderef EMPTY>
  <!ATTLIST noderef
    ref IDREF #REQUIRED>

For example:

<hierarchy direction="narrower" max-levels="-1">
  <node>
    <term>rivers</term>
    <node>
      <term>bends (river)</term>
    </node>
    <node>
      <term>rapids</term>
      <node>
        <term>roaring rapids</term>
      </node>
    </node>
    <node>
      <term>waterfalls</term>
    </node>
  </node>
</hierarchy>

The same example, but with an upper bound placed on the hierarchy depth:

<hierarchy direction="narrower" max-levels="1">
  <node>
    <term>rivers</term>
    <node>
      <term>bends (river)</term>
    </node>
    <node>
      <term>rapids</term>
    </node>
    <node>
      <term>waterfalls</term>
    </node>
  </node>
</hierarchy>

The example below demonstrates the use of node references. Term "images" has two narrower terms, "photographs" and "remote-sensing images", both of which have "aerial photographs" as a narrower term.

<hierarchy direction="narrower" max-levels="-1">
  <node>
    <term>images</term>
    <node>
      <term>photographs</term>
      <node id="n1">
        <term>aerial photographs</term>
      </node>
    </node>
    <node>
      <term>remote-sensing images</term>
      <noderef ref="n1"/>
    </node>
  </node>
</hierarchy>
<error>

Describes an invocation or processing error by a code and/or a human-readable description.

<!ELEMENT error (code?, description?)>

<!ELEMENT code (#PCDATA)>
<!ELEMENT description (#PCDATA)>

For example:

<error>
  <code>914</code>
  <description>Bad input...</description>
</error>
<response>

Contains the response from a thesaurus service. The version attribute indicates the version of the protocol employed by the thesaurus, and must be "1.0".

<!ELEMENT response (properties | list | hierarchy |
  error)>
  <!ATTLIST response
    version CDATA #REQUIRED>

For example:

<response version="1.0">
  <list>
    <term>rivers</term>
    <term preferred="false">river bends</term>
  </list>
</response>

Services

The protocol provides five independent, stateless services. Each service follows the classical model of function invocation: zero or more arguments are passed to the service, the service executes synchronously, and a result is returned. In this section we describe the services abstractly. In the next section, HTTP binding, we describe the specific means by which the services are invoked over the HTTP protocol.

For clarity, in the descriptions below we depict the services as returning certain nominal results. In actuality, the response from each service is a <response> element containing either the nominal result or an error.

properties <- get-properties()

Returns the thesaurus's properties.

list <- download(include-nonpreferred, format)

Returns a list of all terms in the thesaurus. include-nonpreferred, a boolean, indicates if nonpreferred terms should be included; if false, only preferred terms are returned. format is the requested return format, and must be either "term", "term-description", or "extended".

list <- query(operator, text, fuzzy, format)

Queries the thesaurus by term name and returns a list of the matching terms. operator is the matching operator to employ, and must be one of:

  • equals
  • contains-all-words
  • contains-any-words
  • matches-regexp

text is the text to match, and is interpreted either as a string (under the "equals" operator), as one or more words separated by whitespace (under the "contains-all-words" and "contains-any-words" operators), or as a Perl-like regular expression (under the "matches-regexp" operator). fuzzy, a boolean, indicates if the matching should be performed in a forgiving manner, e.g., by employing word stemming or spelling correction. format is the requested return format, and must be either "term", "term-description", or "extended".

A query that produces zero matching terms must not be treated by the thesaurus as an error. A query specifying a non-fuzzy "equals" operator must be treated by the thesaurus as a simple term lookup, i.e., the thesaurus must return either zero terms or the one matching term.

The exact semantics of the query operators (exactly what sequence of characters constitutes a word, if and how fuzziness is implemented, etc.) are not defined by the protocol; the descriptions above are intended to be a guideline. A thesaurus should document its interpretation of the operators in the <description> element of its properties.

hierarchy <- get-broader(starting-term, max-levels, format)

Returns the hierarchy of terms above (broader than) a given starting term. starting-term is the name of the starting term, which must be a preferred term. max-levels, an integer, is the maximum number of levels to include in the hierarchy. A value of zero corresponds to just the starting term; a negative value places no upper bound. format is the requested return format, and must be either "term", "term-description", or "extended".

hierarchy <- get-narrower([starting-term,] max-levels, format)

Returns the hierarchy of terms below (narrower than) a given starting term. If specified and not the empty string, starting-term is the name of the starting term, which must be a preferred term. If absent or the empty string, the starting term is the fictitious "root" term that is broader than all of the thesaurus's top (broadest) preferred terms. max-levels, an integer, is the maximum number of levels to include in the hierarchy. A value of zero corresponds to just the starting term; a negative value places no upper bound. format is the requested return format, and must be either "term", "term-description", or "extended".

HTTP binding

A thesaurus service is invoked over the HTTP protocol by submitting an HTTP GET request to a base URL that represents the thesaurus's common access point for all services. The name of the service is appended to the base URL as the final path component, and arguments to the service are encoded and appended as URL query parameters. The signatures of the five services are as follows:

/get-properties

/download?
  include-nonpreferred={true|false}&
  format=
format

/query?
  operator=
operator&
  text=
text&
  fuzzy={true|false}&
  format=
format

/get-broader?
  starting-term=
name&
  max-levels=
n&
  format=
format

/get-narrower?
  [starting-term=
name&]
  max-levels=
n&
  format=
format

For example, to invoke the get-properties service of the thesaurus located at base URL "http://host.com/mythes/", a client would issue an HTTP GET request to the URL:

http://host.com/mythes/get-properties

Complete examples can be found under Examples, below.

The HTTP response from a thesaurus service has MIME type text/xml and consists of an XML document containing a single <response> element.

Thesaurus services should generally return HTTP status code 200 (OK), and should use HTTP error codes only for low-level errors such as connectivity and authentication problems. Higher-level errors should be returned using the <error> element described under XML formats, above.

Examples

In this section we present three complete examples of protocol service requests and responses.

Suppose first that a client would like brief term records for the broadest terms in the thesaurus located at base URL "http://host.com/mythes/". The client would issue the following HTTP GET request:

http://host.com/mythes/get-narrower?max-levels=1&
format=term

The thesaurus might respond with the following:

HTTP/1.0 200 OK
Content-Type: text/xml; charset=UTF-8
Content-Length: 683

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE response SYSTEM "http://www.alexandria...">
<response
  xmlns="http://www.alexandria.ucsb.edu/thesaurus"
  version="1.0">
  <hierarchy direction="narrower" max-levels="1">
    <node>
      <term></term>
      <node>
        <term>administrative areas</term>
      </node>
      <node>
        <term>hydrographic features</term>
      </node>
      <node>
        <term>land parcels</term>
      </node>
      <node>
        <term>manmade features</term>
      </node>
      <node>
        <term>physiographic features</term>
      </node>
      <node>
        <term>regions</term>
      </node>
    </node>
  </hierarchy>
</response>

Suppose next that the client would like to find all terms containing the words "river" and/or "bends" in the thesaurus. The client would like the word matching to be forgiving, and would like brief term records returned. The client would issue the following HTTP GET request:

http://host.com/mythes/query?operator=contains-any-words&
text=river+bends&fuzzy=true&format=term

The thesaurus might respond with the following:

HTTP/1.0 200 OK
Content-Type: text/xml; charset=UTF-8
Content-Length: 572

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE response SYSTEM "http://www.alexandria...">
<response
  xmlns="http://www.alexandria.ucsb.edu/thesaurus"
  version="1.0">
  <list>
    <term preferred="true">bends (river)</term>
    <term preferred="false">canal bends</term>
    <term preferred="false">lost rivers</term>
    <term preferred="false">road bends</term>
    <term preferred="false">river bends</term>
    <term preferred="true">rivers</term>
    <term preferred="false">stream bends</term>
    <term preferred="false">wadi bends</term>
  </list>
</response>

Finally, suppose that the client would like brief term records for the entire upward (broader) hierarchy of terms starting from term "bends (river)". The client would issue the following HTTP GET request:

http://host.com/mythes/get-broader?starting-term=
bends%20%28river%29&max-levels=-1&format=term

The thesaurus might respond with the following:

HTTP/1.0 200 OK
Content-Type: text/xml; charset=UTF-8
Content-Length: 421

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE response SYSTEM "http://www.alexandria...">
<response
  xmlns="http://www.alexandria.ucsb.edu/thesaurus"
  version="1.0">
  <hierarchy direction="broader" max-levels="-1">
    <node>
      <term>bends (river)</term>
      <node>
        <term>rivers</term>
        <node>
          <term>streams</term>
        </node>
      </node>
    </node>
  </hierarchy>
</response>

Revision history

1.0
Original version.

created 2002-05-01; last modified 2009-01-14 00:24