Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"to_tsquery" on tsvector yields different results when using "simple" and "english"?

I've been enlisted to help on a project and I'm diving back into PostgreSQL after not working with it for several years. Lack of use aside, I've never run into using tsvector fields before and now find myself facing a bug based on them. I read the documentation on the field type and it's purpose, but I'm having a hard time digging up documentation on how 'simple' differs from 'english' as the first parameter to to_tsquery()

Example

> SELECT to_tsvector('mortgag') @@ to_tsquery('simple', 'mortgage')
?column? 
----------
 f
(1 row)

> SELECT to_tsvector('mortgag') @@ to_tsquery('english', 'mortgage')
?column? 
----------
 t
(1 row)

I would think they should both return true, but obviously the first does not - why?

like image 965
phatskat Avatar asked May 25 '12 01:05

phatskat


People also ask

How to_ tsvector works?

The to_tsvector function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where the list can vary depending on the token type.

What postgresql operator checks to see if a Tsquery matches a Tsvector?

@@ operator checks if tsquery matches tsvector. For instance, if the word to be queried is “fox” for the above-mentioned example, then: SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox');

What is Tsvector?

tsvector. A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details).


1 Answers

The FTS utilizes dictionaries to normalize the text:

12.6. Dictionaries

Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme.

So dictionaries are used to throw out things that are too common or meaningless to consider in a search (stop words) and to normalize everything else so city and cities, for example, will match even though they're different words.

Let us look at some output from ts_debug and see what's going on with the dictionaries:

=> select * from ts_debug('english', 'mortgage');
   alias   |   description   |  token   |  dictionaries  |  dictionary  |  lexemes  
-----------+-----------------+----------+----------------+--------------+-----------
 asciiword | Word, all ASCII | mortgage | {english_stem} | english_stem | {mortgag}

=> select * from ts_debug('simple', 'mortgage');
   alias   |   description   |  token   | dictionaries | dictionary |  lexemes   
-----------+-----------------+----------+--------------+------------+------------
 asciiword | Word, all ASCII | mortgage | {simple}     | simple     | {mortgage}

Notice that simple uses the simple dictionary whereas english uses the english_stem dictionary.

The simple dictionary:

operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme.

The simple dictionary just throws out stop words, downcases, and that's about it. We can see its simplicity ourselves:

=> select to_tsquery('simple', 'Mortgage'), to_tsquery('simple', 'Mortgages');
 to_tsquery | to_tsquery  
------------+-------------
 'mortgage' | 'mortgages'

The simple dictionary is too simple to even handle simple plurals.

So what is this english_stem dictionary all about? The "stem" suffix is a give away: this dictionary applies a stemming algorithm to words to convert (for example) city and cities to the same thing. From the fine manual:

12.6.6. Snowball Dictionary

The Snowball dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. [...] Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language.

And just below that we see the english_stem dictionary:

CREATE TEXT SEARCH DICTIONARY english_stem (
    TEMPLATE = snowball,
    Language = english,
    StopWords = english
);

So the english_stem dictionary stems words and we can see that happen:

=> select to_tsquery('english', 'Mortgage'), to_tsquery('english', 'Mortgages');
 to_tsquery | to_tsquery 
------------+------------
 'mortgag'  | 'mortgag'

Executive Summary: 'simple' implies simple minded literal matching, 'english' applies stemming to (hopefully) produce better matching. The stemming turns mortgage into mortgag and that gives you your match.

like image 55
mu is too short Avatar answered Nov 15 '22 23:11

mu is too short