This document describes an XML- and HTTP-based protocol for accessing thesauri: structured, controlled vocabularies of words and phrases that represent conceptual categories.
The protocol is intended to allow programmatic clients to easily access and utilize existing thesauri, and thus the services offered by the protocol are oriented around querying thesauri and navigating within thesauri. The protocol does not support creation, maintenance, or sharing of thesauri, or mapping between thesauri.
The protocol's model of a thesaurus closely follows that of ANSI/NISO Z39.19-1993: Guidelines for the Construction, Format, and Management of Monolingual Thesauri.
A thesaurus is a set of terms and a set of standardized, reciprocal relations on those terms.
A term is a word or phrase that represents a conceptual category. A term may have an associated human-readable description, or scope note, that defines the concept represented by the term and indicates the term's intended usage. Other, arbitrary information may also be associated with a term, but such information is outside the protocol's scope.
There are two varieties of terms, preferred (or valid) and nonpreferred (or invalid or lead-in). Preferred terms participate in all the relations described below; nonpreferred terms participate in the equivalence relations only.
A pair of reciprocal hierarchical relations is the primary means by which thesauri are structured. The narrower (NT) relation relates a preferred term P to another preferred term C that is in some sense a subset of P: as suggested by Z39.19, the concept represented by C may be more specific than that of P, or C may be a component of the whole represented by P, or C may be an instance of the general class represented by P. The narrower relation must be non-reflexive (a term must not be narrower than itself), non-symmetric (two terms must not be mutually narrower than each other), and non-transitive (the narrower relation is logically transitive, that is, if G is narrower than C and C is in turn narrower than P then G is logically narrower than P, but transitive closures must not be reflected in the protocol; rather, they must be left to the client to deduce from first-order relations). The broader (BT) relation is the reciprocal of the narrower relation. A preferred term may be related to any number of broader and narrower terms. The directed graph induced by the narrower relation (equivalently, the broader relation) must be acyclic.
The related (RT) relation relates a preferred term P to another preferred term Q that in some sense intersects P: the concepts represented by P and Q may overlap, or P and Q may be suggestive of each other. The relation must be non-reflexive (a term must not be related to itself), symmetric (if P is related to Q then Q must be related to P), and transitive (if P is related to Q and Q is in turn related to R, then P must be related to R). A preferred term may be related by the related relation to any number of other preferred terms.
A pair of reciprocal equivalence relations ties equivalent terms together. The use-instead (USE) relation maps a nonpreferred term N to a preferred term P that is equivalent to N and that has been designated by the thesaurus as the preferred or canonical term to use in place of N. The used-for (UF) relation is the reciprocal relation that maps P to N. Every nonpreferred term N must be related to at least one preferred term; if more than one, the entire set of N's relations can optionally be designated as a conjunction if N is equivalent to the logical conjunction of the preferred terms.
Eight XML formats are utilized by the protocol. The XML elements
defined below all reside in namespace
"http://www.alexandria.ucsb.edu/thesaurus", but for
brevity we elide namespace declarations in this section. Complete
examples that include namespace declarations are given under Examples, below. An XML DTD that defines the XML
formats can be found in thesaurus-protocol.dtd;
thesaurus-protocol.xsd
is an equivalent XML schema.
<properties>Describes overall properties of the thesaurus: the thesaurus's name and version; a human-readable description (which should include the thesaurus's scope and purpose as well as details on how the thesaurus implements the protocol); an indication of which query operators the thesaurus supports; and the URL of the thesaurus's XML schema for extended term descriptions. All properties but the supported query operators are optional.
<!ELEMENT properties (name?, version?, description?,
query-operators, extended-schema?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT version (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT extended-schema (#PCDATA)>
<!ELEMENT query-operators EMPTY>
<!ATTLIST query-operators
equals (true | false) #REQUIRED
contains-all-words (true | false) #REQUIRED
contains-any-words (true | false) #REQUIRED
matches-regexp (true | false) #REQUIRED>
For example:
<properties>
<name>ADL Feature Type Thesaurus</name>
<version>1.4</version>
<description>Thesaurus for...</description>
<query-operators
equals="true"
contains-all-words="true"
contains-any-words="true"
matches-regexp="false"/>
<extended-schema>http://...</extended-schema>
</properties>
<term>Briefly describes a term by its name and preferredness.
<!ELEMENT term (#PCDATA)>
<!ATTLIST term
preferred (true | false) "true">
Examples:
<term>rivers</term>
<term preferred="false">riverbanks</term>
<term-description>More fully describes a term. In addition to the attributes
described under <term> above, the description
includes the term's immediate relationships to other terms in the
thesaurus and zero or more human-readable descriptive notes. The
<note> element's type attribute can be
used to indicate the type of note (scope note, historical note, etc.).
The <use-instead> element must be employed if and
only if the term is nonpreferred.
<!ELEMENT term-description (term, note*,
((broader, narrower, used-for, related) |
use-instead))>
<!ELEMENT note (#PCDATA)>
<!ATTLIST note
type CDATA #IMPLIED>
<!ELEMENT broader (term*)>
<!ELEMENT narrower (term*)>
<!ELEMENT used-for (term*)>
<!ELEMENT related (term*)>
<!ELEMENT use-instead (term+)>
<!ATTLIST use-instead
conjunction (true | false) "false">
Example description of a preferred term:
<term-description>
<term>rivers</term>
<note type="scope note">Flowing water...</note>
<broader>
<term>streams</term>
</broader>
<narrower>
<term>bends (river)</term>
<term>rapids</term>
<term>waterfalls</term>
</narrower>
<used-for>
<term preferred="false">rios</term>
</used-for>
<related>
<term>channels</term>
<term>guts</term>
</related>
</term-description>
Example description of a nonpreferred term:
<term-description>
<term preferred="false">rios</term>
<note type="scope note">Agua...</note>
<use-instead>
<term>rivers</term>
</use-instead>
</term-description>
Example description of a nonpreferred term that is equivalent to a conjunction of preferred terms:
<term-description>
<term preferred="false">dry stream beds</term>
<use-instead conjunction="true">
<term>streams</term>
<term>historical sites</term>
</use-instead>
</term-description>
<extended>An optional, thesaurus-specific format that describes a single
term. The format is undefined by the protocol; the only requirement
is that the report's structure be described by an XML schema, and that
the URL of that schema be returned by the thesaurus's
get-properties service.
<!ELEMENT extended ANY>
<list>A list of zero or more terms.
<!ELEMENT list (term* | term-description* |
extended*)>
For example:
<list>
<term>rivers</term>
<term preferred="false">river bends</term>
</list>
<hierarchy>Describes the hierarchy of terms above (broader than) or below
(narrower than) a starting preferred term, including the starting term
itself. The hierarchy is indicated by the nesting of XML elements.
Specifically, each <node> element N
describes a term, and the <node> elements nested
immediately within N indicate the term's immediate broader
terms or immediate narrower terms, and so on recursively. If a term
appears multiple times in the hierarchy, each subsequent appearance
must be indicated by a <noderef> element that
refers back to the first occurrence's node.
The direction attribute indicates the direction of the
hierarchy. The max-levels attribute, an integer, is an
upper bound on the number of levels in the hierarchy (using the
convention that zero levels corresponds to just the starting term). A
negative value indicates that the hierarchy is unbounded.
<!ELEMENT hierarchy (node)>
<!ATTLIST hierarchy
direction (broader | narrower) #REQUIRED
max-levels CDATA #REQUIRED>
<!ELEMENT node ((term | term-description | extended),
(node | noderef)*)>
<!ATTLIST node
id ID #IMPLIED>
<!ELEMENT noderef EMPTY>
<!ATTLIST noderef
ref IDREF #REQUIRED>
For example:
<hierarchy direction="narrower" max-levels="-1">
<node>
<term>rivers</term>
<node>
<term>bends (river)</term>
</node>
<node>
<term>rapids</term>
<node>
<term>roaring rapids</term>
</node>
</node>
<node>
<term>waterfalls</term>
</node>
</node>
</hierarchy>
The same example, but with an upper bound placed on the hierarchy depth:
<hierarchy direction="narrower" max-levels="1">
<node>
<term>rivers</term>
<node>
<term>bends (river)</term>
</node>
<node>
<term>rapids</term>
</node>
<node>
<term>waterfalls</term>
</node>
</node>
</hierarchy>
The example below demonstrates the use of node references. Term "images" has two narrower terms, "photographs" and "remote-sensing images", both of which have "aerial photographs" as a narrower term.
<hierarchy direction="narrower" max-levels="-1">
<node>
<term>images</term>
<node>
<term>photographs</term>
<node id="n1">
<term>aerial photographs</term>
</node>
</node>
<node>
<term>remote-sensing images</term>
<noderef ref="n1"/>
</node>
</node>
</hierarchy>
<error>Describes an invocation or processing error by a code and/or a human-readable description.
<!ELEMENT error (code?, description?)>
<!ELEMENT code (#PCDATA)>
<!ELEMENT description (#PCDATA)>
For example:
<error>
<code>914</code>
<description>Bad input...</description>
</error>
<response>Contains the response from a thesaurus service. The
version attribute indicates the version of the protocol
employed by the thesaurus, and must be "1.0".
<!ELEMENT response (properties | list | hierarchy |
error)>
<!ATTLIST response
version CDATA #REQUIRED>
For example:
<response version="1.0">
<list>
<term>rivers</term>
<term preferred="false">river bends</term>
</list>
</response>
The protocol provides five independent, stateless services. Each service follows the classical model of function invocation: zero or more arguments are passed to the service, the service executes synchronously, and a result is returned. In this section we describe the services abstractly. In the next section, HTTP binding, we describe the specific means by which the services are invoked over the HTTP protocol.
For clarity, in the descriptions below we depict the services as
returning certain nominal results. In actuality, the response from
each service is a <response> element containing
either the nominal result or an error.
<-
get-properties()Returns the thesaurus's properties.
<-
download(include-nonpreferred,
format)Returns a list of all terms in the thesaurus.
include-nonpreferred, a boolean, indicates if nonpreferred
terms should be included; if false, only preferred terms are returned.
format is the requested return format, and must be either
"term", "term-description", or
"extended".
<-
query(operator,
text, fuzzy,
format)Queries the thesaurus by term name and returns a list of the matching terms. operator is the matching operator to employ, and must be one of:
equalscontains-all-wordscontains-any-wordsmatches-regexptext is the text to match, and is interpreted either as a
string (under the "equals" operator), as one or more
words separated by whitespace (under the
"contains-all-words" and
"contains-any-words" operators), or as a Perl-like
regular expression (under the "matches-regexp" operator).
fuzzy, a boolean, indicates if the matching should be
performed in a forgiving manner, e.g., by employing word stemming or
spelling correction. format is the requested return format,
and must be either "term",
"term-description", or "extended".
A query that produces zero matching terms must not be treated by
the thesaurus as an error. A query specifying a non-fuzzy
"equals" operator must be treated by the thesaurus as a
simple term lookup, i.e., the thesaurus must return either zero terms
or the one matching term.
The exact semantics of the query operators (exactly what sequence
of characters constitutes a word, if and how fuzziness is implemented,
etc.) are not defined by the protocol; the descriptions above
are intended to be a guideline. A thesaurus should document its
interpretation of the operators in the
<description> element of its properties.
<-
get-broader(starting-term,
max-levels,
format)Returns the hierarchy of terms above (broader than) a given
starting term. starting-term is the name of the starting
term, which must be a preferred term. max-levels, an
integer, is the maximum number of levels to include in the hierarchy.
A value of zero corresponds to just the starting term; a negative
value places no upper bound. format is the requested return
format, and must be either "term",
"term-description", or "extended".
<-
get-narrower([starting-term,]
max-levels,
format)Returns the hierarchy of terms below (narrower than) a given
starting term. If specified and not the empty string,
starting-term is the name of the starting term, which must be
a preferred term. If absent or the empty string, the starting term is
the fictitious "root" term that is broader than all of the thesaurus's
top (broadest) preferred terms. max-levels, an integer, is
the maximum number of levels to include in the hierarchy. A value of
zero corresponds to just the starting term; a negative value places no
upper bound. format is the requested return format, and must
be either "term", "term-description", or
"extended".
A thesaurus service is invoked over the HTTP protocol by submitting an HTTP GET request to a base URL that represents the thesaurus's common access point for all services. The name of the service is appended to the base URL as the final path component, and arguments to the service are encoded and appended as URL query parameters. The signatures of the five services are as follows:
/get-propertiesformat
/download?
include-nonpreferred={true|false}&
format=operator
/query?
operator=&text
text=&format
fuzzy={true|false}&
format=name
/get-broader?
starting-term=&n
max-levels=&format
format=name
/get-narrower?
[starting-term=&]n
max-levels=&format
format=
For example, to invoke the get-properties service of
the thesaurus located at base URL
"http://host.com/mythes/", a client would issue an HTTP
GET request to the URL:
http://host.com/mythes/get-properties
Complete examples can be found under Examples, below.
The HTTP response from a thesaurus service has MIME type
text/xml and consists of an XML document containing a
single <response> element.
Thesaurus services should generally return HTTP status code 200
(OK), and should use HTTP error codes only for low-level errors such
as connectivity and authentication problems. Higher-level errors
should be returned using the <error> element
described under XML formats, above.
In this section we present three complete examples of protocol service requests and responses.
Suppose first that a client would like brief term records for the
broadest terms in the thesaurus located at base URL
"http://host.com/mythes/". The client would issue the
following HTTP GET request:
http://host.com/mythes/get-narrower?max-levels=1&
format=term
The thesaurus might respond with the following:
HTTP/1.0 200 OK
Content-Type: text/xml; charset=UTF-8
Content-Length: 683
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE response SYSTEM "http://www.alexandria...">
<response
xmlns="http://www.alexandria.ucsb.edu/thesaurus"
version="1.0">
<hierarchy direction="narrower" max-levels="1">
<node>
<term></term>
<node>
<term>administrative areas</term>
</node>
<node>
<term>hydrographic features</term>
</node>
<node>
<term>land parcels</term>
</node>
<node>
<term>manmade features</term>
</node>
<node>
<term>physiographic features</term>
</node>
<node>
<term>regions</term>
</node>
</node>
</hierarchy>
</response>
Suppose next that the client would like to find all terms containing the words "river" and/or "bends" in the thesaurus. The client would like the word matching to be forgiving, and would like brief term records returned. The client would issue the following HTTP GET request:
http://host.com/mythes/query?operator=contains-any-words&
text=river+bends&fuzzy=true&format=term
The thesaurus might respond with the following:
HTTP/1.0 200 OK
Content-Type: text/xml; charset=UTF-8
Content-Length: 572
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE response SYSTEM "http://www.alexandria...">
<response
xmlns="http://www.alexandria.ucsb.edu/thesaurus"
version="1.0">
<list>
<term preferred="true">bends (river)</term>
<term preferred="false">canal bends</term>
<term preferred="false">lost rivers</term>
<term preferred="false">road bends</term>
<term preferred="false">river bends</term>
<term preferred="true">rivers</term>
<term preferred="false">stream bends</term>
<term preferred="false">wadi bends</term>
</list>
</response>
Finally, suppose that the client would like brief term records for the entire upward (broader) hierarchy of terms starting from term "bends (river)". The client would issue the following HTTP GET request:
http://host.com/mythes/get-broader?starting-term=
bends%20%28river%29&max-levels=-1&format=term
The thesaurus might respond with the following:
HTTP/1.0 200 OK
Content-Type: text/xml; charset=UTF-8
Content-Length: 421
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE response SYSTEM "http://www.alexandria...">
<response
xmlns="http://www.alexandria.ucsb.edu/thesaurus"
version="1.0">
<hierarchy direction="broader" max-levels="-1">
<node>
<term>bends (river)</term>
<node>
<term>rivers</term>
<node>
<term>streams</term>
</node>
</node>
</node>
</hierarchy>
</response>
created 2002-05-01; last modified 2009-01-14 00:24