Here's a simple SQL implementation of the text queries defined in ADL's query language and in the ADL Gazetteer Protocol's query language. No special indexes or database plug-ins are required.
Consider a table having the general structure:
CREATE TABLE mytable (
id INT NOT NULL,
text VARCHAR(...) NOT NULL
);
For each returnable item identified by id, multiple rows are inserted in mytable, one for each phrase and each suffix of each phrase associated with the item. Text queries are implemented by placing constraints against text, the structure and loading of which allow SQL LIKE operators to be used that avoid initial wildcards, thus allowing queries to benefit from a BTREE index on text.
The text associated with a collection item or gazetteer entry must
be preprocessed by splitting it into phrases and then splitting each
phrase into words. (Words may also need to be uniformly upper- or
lowercased, depending on the database implementation.) There's no
formal definition of phrase; informally, it's a semantic unit
of text on the order of a sentence. The functional importance of
splitting text into phrases is that it prevents
contains-phrase queries from matching text that crosses
phrase boundaries. For gazetteer entries, each placename should be
considered a separate phrase.
Consider a phrase consisting of the word sequence
w1, ..., wN. Before inserting
into text, the phrase must be converted to a string
value by concatenating the words, separating them by delimiter
characters, and adding a trailing delimiter character. If semicolons
are used as delimiters, then the string value is
"w1;...;wN;".
This value must be inserted in text along with
N-1 other values representing every possible nonempty partial
suffix of the phrase. Specifically, for
2 ≤ i ≤ N, a value
"wi;...;wN;"
must be inserted.
For example, an item with identifier 314 and the associated text "Porcelle, porcelle, me inire sine! Non per comam men-men-menti!"1 might be parsed into two phrases and inserted as follows:
| id | text |
|---|---|
314 |
Porcelle;porcelle;me;inire;sine; |
314 |
porcelle;me;inire;sine; |
314 |
me;inire;sine; |
314 |
inire;sine; |
314 |
sine; |
314 |
Non;per;comam;men;men;menti; |
314 |
per;comam;men;men;menti; |
314 |
comam;men;men;menti; |
314 |
men;men;menti; |
314 |
men;menti; |
314 |
menti; |
To save storage, only the first M words of each phrase or
suffix need be stored; this has the effect of establishing M
as the size of the contains-phrase matching window. As
an extreme case, setting M = 1 effectively disables
contains-phrase queries. If equals queries
are to be supported, whole phrases must be stored.
Query text must be processed as described above. In the following, let the query in question be the word sequence q1, ..., qN.
contains-any-wordsSELECT idq1
FROM
mytable
WHERE
text LIKE ';%' OR ... OR text LIKE 'qN;%'
contains-all-wordsThere are two query forms. If matching words may come from any phrase, as is the case with ADL queries:
SELECT t1.idt1
FROM
mytable , ..., mytable tNt2
ON
.id = t1.id AND ... AND tN.id = t1.idt1
WHERE
.text LIKE 'q1;%' AND ... AND tN.text LIKE 'qN;%'
But if matching words must come from the same phrase (e.g., the same placename, as is the case with ADL Gazetteer Protocol queries), then phrases and their partial suffixes must be grouped. One way to do this is to add a column identifying the phrase within the item, as in:
| id | phrase | text |
|---|---|---|
314 |
1 |
Porcelle;porcelle;me;inire;sine; |
314 |
1 |
porcelle;me;inire;sine; |
314 |
1 |
me;inire;sine; |
314 |
1 |
inire;sine; |
314 |
1 |
sine; |
314 |
2 |
Non;per;comam;men;men;menti; |
314 |
2 |
per;comam;men;men;menti; |
314 |
2 |
comam;men;men;menti; |
314 |
2 |
men;men;menti; |
314 |
2 |
men;menti; |
314 |
2 |
menti; |
The query then becomes:
SELECT t1.idt1
FROM
mytable , ..., mytable tNt2
ON
(.id = t1.id AND t2.phrase = t1.phrase) AND ... ANDtN
(.id = t1.id AND tN.phrase = t1.phrase)t1
WHERE
.text LIKE 'q1;%' AND ... AND tN.text LIKE 'qN;%'
contains-phraseSELECT idq1
FROM
mytable
WHERE
text LIKE ';...;qN;%'
equalsTo perform equality matching, whole phrases need to be somehow distinguished from partial suffixes, and the query needs to match against whole phrases only. This can be done by creating an additional column, as in:
SELECT idq1
FROM
mytable
WHERE
whole_phrase = 1 AND text = ';...;qN;'
Another approach is to distinguish the whole phrases themselves. For example, if an extra delimiter is added to the end of whole phrases (only), the query becomes:
SELECT idq1
FROM
mytable
WHERE
text = ';...;qN;;'
matches-patternPattern matching can be supported by translating query language
wildcards ("*" and "?" in the case of the
ADL Gazetteer Protocol) to corresponding SQL LIKE wildcards
("%" and "_"). If there is no leading
wildcard, the query can generally take advantage of a BTREE index.
Additional processing must be done to translate word breaks to
delimiters. Specifically, the query string must be processed just as
text values are processed, except that wildcards
should be considered to be word characters, not punctuation. In
addition, the query must match against whole phrases only, as
described under the equals operator above. For example,
a pattern "san* ?a*a" could be
implemented by a query:
SELECT id
FROM
mytable
WHERE
text LIKE 'san%;_a%a;;'
The first four of the following seven text values would match the pattern:
Santa;Barbara;;
Santa;Paula;;
San;Paglio;di;Rosa;;
Santo;de;la;Tortuga;;
Lake;Santa;Barbara;;
Santa;Barbara;Island;;
Santa;Clarita;;
1 Hint: porcelle = little pig
created 2003-11-18; last modified 2009-01-20 09:24