|
/ modules / paradigms / Textual_MySQLFulltext.py
SYNOPSIS
Textual_MySQLFulltext (table, idColumn, textColumns, cardinality,
regexpPhraseFilter=1,
mapping=TextUtils.mappings.nonAlphanumericToWhitespace,
deleteList=TextUtils.deleteLists.keepAll)
table
A table to query, e.g., "holding".
idColumn
The table's identifier column (i.e., the column to be
selected), e.g., "holding_id".
textColumns
A single table column (e.g., "subject_text") or a list
of one or more table columns (e.g., ["subject_text",
"assigned_terms"]) containing the text to search over.
cardinality
A Cardinality object representing the cardinality of
'table' with respect to the column or columns listed in
'textColumns'.
regexpPhraseFilter
A boolean that indicates if "contains-phrase"
constraints are to be translated as "contains-all-words"
constraints conjoined with REGEXP conditions. Defaults
to true. See below.
mapping
A Python character mapping table (i.e., a string of
length 256, indexed by ASCII character code) to process
constraint text with. Defaults to
'nonAlphanumericToWhitespace', which maps
non-alphanumeric characters to whitespace (i.e., to word
separators).
deleteList
A string of zero or more characters to delete from
constraint text. The default is the empty string, which
keeps all characters.
DESCRIPTION
Translates a textual constraint to a MySQL fulltext index
search. The returned query has the general form
SELECT idColumn FROM table
WHERE MATCH (textColumns, ...)
AGAINST ('expression' IN BOOLEAN MODE)
where 'expression' is a string expression whose form depends on
the constraint operator. In the following, let W1, W2, W3, ...,
Wn be the words formed from the constraint text T by 1) deleting
from T any characters that appear in 'deleteList'; 2) mapping
the remaining characters using 'mapping'; and 3) treating
sequences of whitespace characters as word separators. Then the
query expression is:
contains-any-words
W1 W2 W3 ... Wn
contains-all-words
+W1 +W2 +W3 ... +Wn
contains-phrase
"W1 W2 W3 ... Wn"
Note that MySQL's phrase matching (as of version 4.1.0alpha) is
essentially simple substring matching, and thus will have poor
recall performance unless the text in the table has been
appropriately processed beforehand (namely, adjacent words
within a phrase must be separated by exactly one space). But if
the 'regexpPhraseFilter' argument is true, and if the query
phrase contains more than one word, then the returned query has
the alternate form
SELECT idColumn FROM table
WHERE MATCH (textColumns, ...)
AGAINST ('+W1 +W2 +W3 ... +Wn' IN BOOLEAN MODE) AND
(textColumn1 REGEXP
'[[:<:]]W1[[:space:]]+W2[[:space:]]+...Wn[[:>:]]'
OR textColumn2 REGEXP
'[[:<:]]W1[[:space:]]+W2[[:space:]]+...Wn[[:>:]]'
OR ...)
I.e., the REGEXP filter more forgivingly allows adjacent words
within a phrase to be separated by one or more whitespace
characters.
The semantics of the "contains-all-words" operator will
generally be correct only if the cardinality is "1" or "1?". If
the cardinality is "0+" or "1+", wrap this paradigm in an
Adaptor_IndivisibleConcatenation paradigm.
This paradigm assumes that the text processing specified by
'mapping' and 'deleteList' is compatible with MySQL's notion of
words, which it is, by default.
Exceptions thrown:
no query words specified
AUTHOR
Greg Janee
gjanee@alexandria.ucsb.edu
HISTORY
$Log: Textual_MySQLFulltext.py,v $
Revision 1.2 2003/12/15 23:54:18 peter
Mondified source code documentation so that it formats properly when
creating HTML documents with happydoc.
Revision 1.1 2003/12/08 23:32:56 valentin
update to oct2003
Revision 1.1 2003/11/06 04:41:41 gjanee
Initial revision
|
Functions
|
|
|
|
_formAnyExpression
|
_formAnyExpression ( wordList )
|
|
|
_formAllExpression
|
_formAllExpression ( wordList )
|
|
|
_formRegexp
|
_formRegexp ( wordList )
|
|
|
_formPhraseExpression
|
_formPhraseExpression ( wordList )
|
|
|
_protectRegexpSpecials
|
_protectRegexpSpecials ( word )
|
|
|