Simple SQL-Based Text Search

Here's a simple SQL implementation of the text queries defined in ADL's query language and in the ADL Gazetteer Protocol's query language. No special indexes or database plug-ins are required.

Overview

Consider a table having the general structure:

CREATE TABLE mytable (
    id   INT          NOT NULL,
    text VARCHAR(...) NOT NULL
);

For each returnable item identified by id, multiple rows are inserted in mytable, one for each phrase and each suffix of each phrase associated with the item. Text queries are implemented by placing constraints against text, the structure and loading of which allow SQL LIKE operators to be used that avoid initial wildcards, thus allowing queries to benefit from a BTREE index on text.

Text preprocessing and loading

The text associated with a collection item or gazetteer entry must be preprocessed by splitting it into phrases and then splitting each phrase into words. (Words may also need to be uniformly upper- or lowercased, depending on the database implementation.) There's no formal definition of phrase; informally, it's a semantic unit of text on the order of a sentence. The functional importance of splitting text into phrases is that it prevents contains-phrase queries from matching text that crosses phrase boundaries. For gazetteer entries, each placename should be considered a separate phrase.

Consider a phrase consisting of the word sequence w1, ..., wN. Before inserting into text, the phrase must be converted to a string value by concatenating the words, separating them by delimiter characters, and adding a trailing delimiter character. If semicolons are used as delimiters, then the string value is "w1;...;wN;". This value must be inserted in text along with N-1 other values representing every possible nonempty partial suffix of the phrase. Specifically, for 2 ≤ i ≤ N, a value "wi;...;wN;" must be inserted.

For example, an item with identifier 314 and the associated text "Porcelle, porcelle, me inire sine! Non per comam men-men-menti!"1 might be parsed into two phrases and inserted as follows:

id text
314 Porcelle;porcelle;me;inire;sine;
314 porcelle;me;inire;sine;
314 me;inire;sine;
314 inire;sine;
314 sine;
314 Non;per;comam;men;men;menti;
314 per;comam;men;men;menti;
314 comam;men;men;menti;
314 men;men;menti;
314 men;menti;
314 menti;

To save storage, only the first M words of each phrase or suffix need be stored; this has the effect of establishing M as the size of the contains-phrase matching window. As an extreme case, setting M = 1 effectively disables contains-phrase queries. If equals queries are to be supported, whole phrases must be stored.

Queries

Query text must be processed as described above. In the following, let the query in question be the word sequence q1, ..., qN.

contains-any-words

SELECT id
  FROM
    mytable
  WHERE
    text LIKE '
q1;%' OR ... OR text LIKE 'qN;%'

contains-all-words

There are two query forms. If matching words may come from any phrase, as is the case with ADL queries:

SELECT t1.id
  FROM
    mytable
t1, ..., mytable tN
  ON
    
t2.id = t1.id AND ... AND tN.id = t1.id
  WHERE
    
t1.text LIKE 'q1;%' AND ... AND tN.text LIKE 'qN;%'

But if matching words must come from the same phrase (e.g., the same placename, as is the case with ADL Gazetteer Protocol queries), then phrases and their partial suffixes must be grouped. One way to do this is to add a column identifying the phrase within the item, as in:

id phrase text
314 1 Porcelle;porcelle;me;inire;sine;
314 1 porcelle;me;inire;sine;
314 1 me;inire;sine;
314 1 inire;sine;
314 1 sine;
314 2 Non;per;comam;men;men;menti;
314 2 per;comam;men;men;menti;
314 2 comam;men;men;menti;
314 2 men;men;menti;
314 2 men;menti;
314 2 menti;

The query then becomes:

SELECT t1.id
  FROM
    mytable
t1, ..., mytable tN
  ON
    (
t2.id = t1.id AND t2.phrase = t1.phrase) AND ... AND
    (
tN.id = t1.id AND tN.phrase = t1.phrase)
  WHERE
    
t1.text LIKE 'q1;%' AND ... AND tN.text LIKE 'qN;%'

contains-phrase

SELECT id
  FROM
    mytable
  WHERE
    text LIKE '
q1;...;qN;%'

equals

To perform equality matching, whole phrases need to be somehow distinguished from partial suffixes, and the query needs to match against whole phrases only. This can be done by creating an additional column, as in:

SELECT id
  FROM
    mytable
  WHERE
    whole_phrase = 1 AND text = '
q1;...;qN;'

Another approach is to distinguish the whole phrases themselves. For example, if an extra delimiter is added to the end of whole phrases (only), the query becomes:

SELECT id
  FROM
    mytable
  WHERE
    text = '
q1;...;qN;;'

matches-pattern

Pattern matching can be supported by translating query language wildcards ("*" and "?" in the case of the ADL Gazetteer Protocol) to corresponding SQL LIKE wildcards ("%" and "_"). If there is no leading wildcard, the query can generally take advantage of a BTREE index.

Additional processing must be done to translate word breaks to delimiters. Specifically, the query string must be processed just as text values are processed, except that wildcards should be considered to be word characters, not punctuation. In addition, the query must match against whole phrases only, as described under the equals operator above. For example, a pattern "san* ?a*a" could be implemented by a query:

SELECT id
  FROM
    mytable
  WHERE
    text LIKE 'san%;_a%a;;'

The first four of the following seven text values would match the pattern:

Santa;Barbara;;
Santa;Paula;;
San;Paglio;di;Rosa;;
Santo;de;la;Tortuga;;
Lake;Santa;Barbara;;
Santa;Barbara;Island;;
Santa;Clarita;;

1 Hint: porcelle = little pig

created 2003-11-18; last modified 2009-01-20 09:24