|
The ADL Thesaurus Protocol
Version 1.0
Contents
This document describes an XML- and HTTP-based protocol for
accessing thesauri: structured, controlled vocabularies of
words and phrases that represent conceptual categories.
The protocol is intended to allow programmatic clients to easily
access and utilize existing thesauri, and thus the services offered by
the protocol are oriented around querying thesauri and navigating
within thesauri. The protocol does not support creation, maintenance,
or sharing of thesauri, or mapping between thesauri.
The protocol's model of a thesaurus closely follows that of ANSI/NISO
Z39.19-1993: Guidelines for the Construction, Format, and Management
of Monolingual Thesauri.
A thesaurus is a set of terms and a set of standardized,
reciprocal relations on those terms.
A term is a word or phrase that represents a conceptual
category. A term may have an associated human-readable description,
or scope note, that defines the concept represented by the
term and indicates the term's intended usage. Other, arbitrary
information may also be associated with a term, but such information
is outside the protocol's scope.
There are two varieties of terms, preferred (or valid) and
nonpreferred (or invalid or lead-in). Preferred terms
participate in all the relations described below; nonpreferred terms
participate in the equivalence relations only.
A pair of reciprocal hierarchical relations is the primary means by
which thesauri are structured. The narrower (NT) relation
relates a preferred term P to another preferred term
C that is in some sense a subset of P: as suggested
by Z39.19, the concept represented by C may be more specific
than that of P, or C may be a component of the whole
represented by P, or C may be an instance of the
general class represented by P. The narrower relation must
be non-reflexive (a term must not be narrower than itself),
non-symmetric (two terms must not be mutually narrower than each
other), and non-transitive (the narrower relation is logically
transitive, that is, if G is narrower than C and
C is in turn narrower than P then G is
logically narrower than P, but transitive closures must not
be reflected in the protocol; rather, they must be left to the client
to deduce from first-order relations). The broader (BT)
relation is the reciprocal of the narrower relation. A preferred term
may be related to any number of broader and narrower terms. The
directed graph induced by the narrower relation (equivalently, the
broader relation) must be acyclic.
The related (RT) relation relates a preferred term
P to another preferred term Q that in some sense
intersects P: the concepts represented by P and
Q may overlap, or P and Q may be suggestive
of each other. The relation must be non-reflexive (a term must not be
related to itself), symmetric (if P is related to Q
then Q must be related to P), and transitive (if
P is related to Q and Q is in turn related
to R, then P must be related to R). A
preferred term may be related by the related relation to any
number of other preferred terms.
A pair of reciprocal equivalence relations ties equivalent terms
together. The use-instead (USE) relation maps a nonpreferred
term N to a preferred term P that is equivalent to
N and that has been designated by the thesaurus as the
preferred or canonical term to use in place of N. The
used-for (UF) relation is the reciprocal relation that maps
P to N. Every nonpreferred term N must be
related to at least one preferred term; if more than one, the entire
set of N's relations can optionally be designated as a
conjunction if N is equivalent to the logical conjunction of
the preferred terms.
Eight XML formats are utilized by the protocol. The XML elements
defined below all reside in namespace
"http://www.alexandria.ucsb.edu/thesaurus", but for
brevity we elide namespace declarations in this section. Complete
examples that include namespace declarations are given under Examples, below. An XML DTD that defines the XML
formats can be found in thesaurus-protocol.dtd;
thesaurus-protocol.xsd
is an equivalent XML schema.
<properties>
-
Describes overall properties of the thesaurus: the thesaurus's name
and version; a human-readable description (which should include the
thesaurus's scope and purpose as well as details on how the thesaurus
implements the protocol); an indication of which query operators the
thesaurus supports; and the URL of the thesaurus's XML schema for
extended term descriptions. All properties but the supported query
operators are optional.
<!ELEMENT properties (name?, version?, description?,
query-operators, extended-schema?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT version (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT extended-schema (#PCDATA)>
<!ELEMENT query-operators EMPTY>
<!ATTLIST query-operators
equals (true | false) #REQUIRED
contains-all-words (true | false) #REQUIRED
contains-any-words (true | false) #REQUIRED
matches-regexp (true | false) #REQUIRED>
For example:
<properties>
<name>ADL Feature Type Thesaurus</name>
<version>1.4</version>
<description>Thesaurus for...</description>
<query-operators
equals="true"
contains-all-words="true"
contains-any-words="true"
matches-regexp="false"/>
<extended-schema>http://...</extended-schema>
</properties>
<term>
-
Briefly describes a term by its name and preferredness.
<!ELEMENT term (#PCDATA)>
<!ATTLIST term
preferred (true | false) "true">
Examples:
<term>rivers</term>
<term preferred="false">riverbanks</term>
<term-description>
-
More fully describes a term. In addition to the attributes
described under <term> above, the description
includes the term's immediate relationships to other terms in the
thesaurus and zero or more human-readable descriptive notes. The
<note> element's type attribute can be
used to indicate the type of note (scope note, historical note, etc.).
The <use-instead> element must be employed if and
only if the term is nonpreferred.
<!ELEMENT term-description (term, note*,
((broader, narrower, used-for, related) |
use-instead))>
<!ELEMENT note (#PCDATA)>
<!ATTLIST note
type CDATA #IMPLIED>
<!ELEMENT broader (term*)>
<!ELEMENT narrower (term*)>
<!ELEMENT used-for (term*)>
<!ELEMENT related (term*)>
<!ELEMENT use-instead (term+)>
<!ATTLIST use-instead
conjunction (true | false) "false">
Example description of a preferred term:
<term-description>
<term>rivers</term>
<note type="scope note">Flowing water...</note>
<broader>
<term>streams</term>
</broader>
<narrower>
<term>bends (river)</term>
<term>rapids</term>
<term>waterfalls</term>
</narrower>
<used-for>
<term preferred="false">rios</term>
</used-for>
<related>
<term>channels</term>
<term>guts</term>
</related>
</term-description>
Example description of a nonpreferred term:
<term-description>
<term preferred="false">rios</term>
<note type="scope note">Agua...</note>
<use-instead>
<term>rivers</term>
</use-instead>
</term-description>
Example description of a nonpreferred term that is equivalent to a
conjunction of preferred terms:
<term-description>
<term preferred="false">dry stream beds</term>
<use-instead conjunction="true">
<term>streams</term>
<term>historical sites</term>
</use-instead>
</term-description>
<extended>
-
An optional, thesaurus-specific format that describes a single
term. The format is undefined by the protocol; the only requirement
is that the report's structure be described by an XML schema, and that
the URL of that schema be returned by the thesaurus's
get-properties service.
<!ELEMENT extended ANY>
<list>
-
A list of zero or more terms.
<!ELEMENT list (term* | term-description* |
extended*)>
For example:
<list>
<term>rivers</term>
<term preferred="false">river bends</term>
</list>
<hierarchy>
-
Describes the hierarchy of terms above (broader than) or below
(narrower than) a starting preferred term, including the starting term
itself. The hierarchy is indicated by the nesting of XML elements.
Specifically, each <node> element N
describes a term, and the <node> elements nested
immediately within N indicate the term's immediate broader
terms or immediate narrower terms, and so on recursively. If a term
appears multiple times in the hierarchy, each subsequent appearance
must be indicated by a <noderef> element that
refers back to the first occurrence's node.
The direction attribute indicates the direction of the
hierarchy. The max-levels attribute, an integer, is an
upper bound on the number of levels in the hierarchy (using the
convention that zero levels corresponds to just the starting term). A
negative value indicates that the hierarchy is unbounded.
<!ELEMENT hierarchy (node)>
<!ATTLIST hierarchy
direction (broader | narrower) #REQUIRED
max-levels CDATA #REQUIRED>
<!ELEMENT node ((term | term-description | extended),
(node | noderef)*)>
<!ATTLIST node
id ID #IMPLIED>
<!ELEMENT noderef EMPTY>
<!ATTLIST noderef
ref IDREF #REQUIRED>
For example:
<hierarchy direction="narrower" max-levels="-1">
<node>
<term>rivers</term>
<node>
<term>bends (river)</term>
</node>
<node>
<term>rapids</term>
<node>
<term>roaring rapids</term>
</node>
</node>
<node>
<term>waterfalls</term>
</node>
</node>
</hierarchy>
The same example, but with an upper bound placed on the hierarchy
depth:
<hierarchy direction="narrower" max-levels="1">
<node>
<term>rivers</term>
<node>
<term>bends (river)</term>
</node>
<node>
<term>rapids</term>
</node>
<node>
<term>waterfalls</term>
</node>
</node>
</hierarchy>
The example below demonstrates the use of node references. Term
"images" has two narrower terms, "photographs" and "remote-sensing
images", both of which have "aerial photographs" as a narrower
term.
<hierarchy direction="narrower" max-levels="-1">
<node>
<term>images</term>
<node>
<term>photographs</term>
<node id="n1">
<term>aerial photographs</term>
</node>
</node>
<node>
<term>remote-sensing images</term>
<noderef ref="n1"/>
</node>
</node>
</hierarchy>
<error>
-
Describes an invocation or processing error by a code and/or a
human-readable description.
<!ELEMENT error (code?, description?)>
<!ELEMENT code (#PCDATA)>
<!ELEMENT description (#PCDATA)>
For example:
<error>
<code>914</code>
<description>Bad input...</description>
</error>
<response>
-
Contains the response from a thesaurus service. The
version attribute indicates the version of the protocol
employed by the thesaurus, and must be "1.0".
<!ELEMENT response (properties | list | hierarchy |
error)>
<!ATTLIST response
version CDATA #REQUIRED>
For example:
<response version="1.0">
<list>
<term>rivers</term>
<term preferred="false">river bends</term>
</list>
</response>
The protocol provides five independent, stateless services. Each
service follows the classical model of function invocation: zero or
more arguments are passed to the service, the service executes
synchronously, and a result is returned. In this section we describe
the services abstractly. In the next section, HTTP binding, we describe the specific means
by which the services are invoked over the HTTP protocol.
For clarity, in the descriptions below we depict the services as
returning certain nominal results. In actuality, the response from
each service is a <response> element containing
either the nominal result or an error.
- properties
<-
get-properties()
-
Returns the thesaurus's properties.
- list
<-
download(include-nonpreferred,
format)
-
Returns a list of all terms in the thesaurus.
include-nonpreferred, a boolean, indicates if nonpreferred
terms should be included; if false, only preferred terms are returned.
format is the requested return format, and must be either
"term", "term-description", or
"extended".
- list
<-
query(operator,
text, fuzzy,
format)
-
Queries the thesaurus by term name and returns a list of the
matching terms. operator is the matching operator to employ,
and must be one of:
equals
contains-all-words
contains-any-words
matches-regexp
text is the text to match, and is interpreted either as a
string (under the "equals" operator), as one or more
words separated by whitespace (under the
"contains-all-words" and
"contains-any-words" operators), or as a Perl-like
regular expression (under the "matches-regexp" operator).
fuzzy, a boolean, indicates if the matching should be
performed in a forgiving manner, e.g., by employing word stemming or
spelling correction. format is the requested return format,
and must be either "term",
"term-description", or "extended".
A query that produces zero matching terms must not be treated by
the thesaurus as an error. A query specifying a non-fuzzy
"equals" operator must be treated by the thesaurus as a
simple term lookup, i.e., the thesaurus must return either zero terms
or the one matching term.
The exact semantics of the query operators (exactly what sequence
of characters constitutes a word, if and how fuzziness is implemented,
etc.) are not defined by the protocol; the descriptions above
are intended to be a guideline. A thesaurus should document its
interpretation of the operators in the
<description> element of its properties.
- hierarchy
<-
get-broader(starting-term,
max-levels,
format)
-
Returns the hierarchy of terms above (broader than) a given
starting term. starting-term is the name of the starting
term, which must be a preferred term. max-levels, an
integer, is the maximum number of levels to include in the hierarchy.
A value of zero corresponds to just the starting term; a negative
value places no upper bound. format is the requested return
format, and must be either "term",
"term-description", or "extended".
- hierarchy
<-
get-narrower([starting-term,]
max-levels,
format)
-
Returns the hierarchy of terms below (narrower than) a given
starting term. If specified and not the empty string,
starting-term is the name of the starting term, which must be
a preferred term. If absent or the empty string, the starting term is
the fictitious "root" term that is broader than all of the thesaurus's
top (broadest) preferred terms. max-levels, an integer, is
the maximum number of levels to include in the hierarchy. A value of
zero corresponds to just the starting term; a negative value places no
upper bound. format is the requested return format, and must
be either "term", "term-description", or
"extended".
A thesaurus service is invoked over the HTTP protocol by
submitting an HTTP GET request to a base URL that represents the
thesaurus's common access point for all services. The name of the
service is appended to the base URL as the final path component, and
arguments to the service are encoded and appended as URL query
parameters. The signatures of the five services are as follows:
/get-properties
/download?
include-nonpreferred={true|false}&
format=format
/query?
operator=operator&
text=text&
fuzzy={true|false}&
format=format
/get-broader?
starting-term=name&
max-levels=n&
format=format
/get-narrower?
[starting-term=name&]
max-levels=n&
format=format
For example, to invoke the get-properties service of
the thesaurus located at base URL
"http://host.com/mythes/", a client would issue an HTTP
GET request to the URL:
http://host.com/mythes/get-properties
Complete examples can be found under Examples, below.
The HTTP response from a thesaurus service has MIME type
text/xml and consists of an XML document containing a
single <response> element.
Thesaurus services should generally return HTTP status code 200
(OK), and should use HTTP error codes only for low-level errors such
as connectivity and authentication problems. Higher-level errors
should be returned using the <error> element
described under XML formats, above.
In this section we present three complete examples of protocol
service requests and responses.
Suppose first that a client would like brief term records for the
broadest terms in the thesaurus located at base URL
"http://host.com/mythes/". The client would issue the
following HTTP GET request:
http://host.com/mythes/get-narrower?max-levels=1&
format=term
The thesaurus might respond with the following:
HTTP/1.0 200 OK
Content-Type: text/xml; charset=UTF-8
Content-Length: 683
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE response SYSTEM "http://www.alexandria...">
<response
xmlns="http://www.alexandria.ucsb.edu/thesaurus"
version="1.0">
<hierarchy direction="narrower" max-levels="1">
<node>
<term></term>
<node>
<term>administrative areas</term>
</node>
<node>
<term>hydrographic features</term>
</node>
<node>
<term>land parcels</term>
</node>
<node>
<term>manmade features</term>
</node>
<node>
<term>physiographic features</term>
</node>
<node>
<term>regions</term>
</node>
</node>
</hierarchy>
</response>
Suppose next that the client would like to find all terms
containing the words "river" and/or "bends" in the thesaurus. The
client would like the word matching to be forgiving, and would like
brief term records returned. The client would issue the following
HTTP GET request:
http://host.com/mythes/query?operator=contains-any-words&
text=river+bends&fuzzy=true&format=term
The thesaurus might respond with the following:
HTTP/1.0 200 OK
Content-Type: text/xml; charset=UTF-8
Content-Length: 572
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE response SYSTEM "http://www.alexandria...">
<response
xmlns="http://www.alexandria.ucsb.edu/thesaurus"
version="1.0">
<list>
<term preferred="true">bends (river)</term>
<term preferred="false">canal bends</term>
<term preferred="false">lost rivers</term>
<term preferred="false">road bends</term>
<term preferred="false">river bends</term>
<term preferred="true">rivers</term>
<term preferred="false">stream bends</term>
<term preferred="false">wadi bends</term>
</list>
</response>
Finally, suppose that the client would like brief term records for
the entire upward (broader) hierarchy of terms starting from term
"bends (river)". The client would issue the following HTTP GET
request:
http://host.com/mythes/get-broader?starting-term=
bends%20%28river%29&max-levels=-1&format=term
The thesaurus might respond with the following:
HTTP/1.0 200 OK
Content-Type: text/xml; charset=UTF-8
Content-Length: 421
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE response SYSTEM "http://www.alexandria...">
<response
xmlns="http://www.alexandria.ucsb.edu/thesaurus"
version="1.0">
<hierarchy direction="broader" max-levels="-1">
<node>
<term>bends (river)</term>
<node>
<term>rivers</term>
<node>
<term>streams</term>
</node>
</node>
</node>
</hierarchy>
</response>
- 1.0
- Original version.
Greg
Janée
Created: 2002-05-01
Last modified: 2008-02-28 11:32
|