"to_tsquery" on tsvector yields different results when using "simple" and "english"?

Tags:

I've been enlisted to help on a project and I'm diving back into PostgreSQL after not working with it for several years. Lack of use aside, I've never run into using tsvector fields before and now find myself facing a bug based on them. I read the documentation on the field type and it's purpose, but I'm having a hard time digging up documentation on how 'simple' differs from 'english' as the first parameter to to_tsquery()

Example

> SELECT to_tsvector('mortgag') @@ to_tsquery('simple', 'mortgage')
?column? 
----------
 f
(1 row)

> SELECT to_tsvector('mortgag') @@ to_tsquery('english', 'mortgage')
?column? 
----------
 t
(1 row)

I would think they should both return true, but obviously the first does not - why?

965

asked May 25 '12 01:05

phatskat

1 Answers

The FTS utilizes dictionaries to normalize the text:

12.6. Dictionaries

Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme.

So dictionaries are used to throw out things that are too common or meaningless to consider in a search (stop words) and to normalize everything else so city and cities, for example, will match even though they're different words.

Let us look at some output from ts_debug and see what's going on with the dictionaries:

=> select * from ts_debug('english', 'mortgage');
   alias   |   description   |  token   |  dictionaries  |  dictionary  |  lexemes  
-----------+-----------------+----------+----------------+--------------+-----------
 asciiword | Word, all ASCII | mortgage | {english_stem} | english_stem | {mortgag}

=> select * from ts_debug('simple', 'mortgage');
   alias   |   description   |  token   | dictionaries | dictionary |  lexemes   
-----------+-----------------+----------+--------------+------------+------------
 asciiword | Word, all ASCII | mortgage | {simple}     | simple     | {mortgage}

Notice that simple uses the simple dictionary whereas english uses the english_stem dictionary.

The simple dictionary:

operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme.

The simple dictionary just throws out stop words, downcases, and that's about it. We can see its simplicity ourselves:

=> select to_tsquery('simple', 'Mortgage'), to_tsquery('simple', 'Mortgages');
 to_tsquery | to_tsquery  
------------+-------------
 'mortgage' | 'mortgages'

The simple dictionary is too simple to even handle simple plurals.

So what is this english_stem dictionary all about? The "stem" suffix is a give away: this dictionary applies a stemming algorithm to words to convert (for example) city and cities to the same thing. From the fine manual:

12.6.6. Snowball Dictionary

The Snowball dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. [...] Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language.

And just below that we see the english_stem dictionary:

CREATE TEXT SEARCH DICTIONARY english_stem (
    TEMPLATE = snowball,
    Language = english,
    StopWords = english
);

So the english_stem dictionary stems words and we can see that happen:

=> select to_tsquery('english', 'Mortgage'), to_tsquery('english', 'Mortgages');
 to_tsquery | to_tsquery 
------------+------------
 'mortgag'  | 'mortgag'

Executive Summary: 'simple' implies simple minded literal matching, 'english' applies stemming to (hopefully) produce better matching. The stemming turns mortgage into mortgag and that gives you your match.

answered Nov 15 '22 23:11

mu is too short

Related questions
                            
                                What is "Structured" in Structured Query Language?
                            
                                SQLXML without XML encoding?
                            
                                Difference between Delete and Truncate in sql server. Was I wrong...?
                            
                                How to concat_ws multiple fields and remove duplicate separators for empty slots
                            
                                What does the (+) operator mean in a where-clause of PL/SQL?
                            
                                PostgreSQL select 20 random records [duplicate]
                            
                                Connecting to Microsoft SQL Server using Clojure
                            
                                How to view SQL database in Eclipse Debug mode for android
                            
                                Can ODBC parameter place holders be named?
                            
                                SQL: What does NULL as ColumnName imply
                            
                                SQL do inner join if condition met
                            
                                'Greater than or equal' and 'less than or equal' CODEIGNITER
                            
                                Sparksql filtering (selecting with where clause) with multiple conditions
                            
                                How to count a boolean in grouped Spark data frame
                            
                                Updating multiple rows Linq vs SQL
                            
                                What is the preferred way to store custom fields in a SQL database?
                            
                                mysql unique number generation
                            
                                How can I create a CHECK constraint on a VARCHAR column in SQL Server specifying a minimum data length?
                            
                                Is there a sql editor powered by HTML + css + javascript? [closed]
                            
                                Index for nullable column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

"to_tsquery" on tsvector yields different results when using "simple" and "english"?

Tags:

sql

full-text-search

postgresql

tsvector

phatskat

People also ask

1 Answers

mu is too short

Recent Activity

Donate For Us