I've been enlisted to help on a project and I'm diving back into PostgreSQL after not working with it for several years. Lack of use aside, I've never run into using tsvector fields before and now find myself facing a bug based on them. I read the documentation on the field type and it's purpose, but I'm having a hard time digging up documentation on how 'simple' differs from 'english' as the first parameter to to_tsquery()
Example
> SELECT to_tsvector('mortgag') @@ to_tsquery('simple', 'mortgage')
?column?
----------
f
(1 row)
> SELECT to_tsvector('mortgag') @@ to_tsquery('english', 'mortgage')
?column?
----------
t
(1 row)
I would think they should both return true, but obviously the first does not - why?
The to_tsvector function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where the list can vary depending on the token type.
@@ operator checks if tsquery matches tsvector. For instance, if the word to be queried is “fox” for the above-mentioned example, then: SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox');
tsvector. A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details).
The FTS utilizes dictionaries to normalize the text:
12.6. Dictionaries
Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme.
So dictionaries are used to throw out things that are too common or meaningless to consider in a search (stop words) and to normalize everything else so city and cities, for example, will match even though they're different words.
Let us look at some output from ts_debug
and see what's going on with the dictionaries:
=> select * from ts_debug('english', 'mortgage');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+----------------+--------------+-----------
asciiword | Word, all ASCII | mortgage | {english_stem} | english_stem | {mortgag}
=> select * from ts_debug('simple', 'mortgage');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | mortgage | {simple} | simple | {mortgage}
Notice that simple
uses the simple
dictionary whereas english
uses the english_stem
dictionary.
The simple
dictionary:
operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme.
The simple
dictionary just throws out stop words, downcases, and that's about it. We can see its simplicity ourselves:
=> select to_tsquery('simple', 'Mortgage'), to_tsquery('simple', 'Mortgages');
to_tsquery | to_tsquery
------------+-------------
'mortgage' | 'mortgages'
The simple
dictionary is too simple to even handle simple plurals.
So what is this english_stem
dictionary all about? The "stem" suffix is a give away: this dictionary applies a stemming algorithm to words to convert (for example) city and cities to the same thing. From the fine manual:
12.6.6. Snowball Dictionary
The Snowball dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. [...] Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language.
And just below that we see the english_stem
dictionary:
CREATE TEXT SEARCH DICTIONARY english_stem ( TEMPLATE = snowball, Language = english, StopWords = english );
So the english_stem
dictionary stems words and we can see that happen:
=> select to_tsquery('english', 'Mortgage'), to_tsquery('english', 'Mortgages');
to_tsquery | to_tsquery
------------+------------
'mortgag' | 'mortgag'
Executive Summary: 'simple'
implies simple minded literal matching, 'english'
applies stemming to (hopefully) produce better matching. The stemming turns mortgage into mortgag and that gives you your match.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With