Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problems at postgresql full text search with words, containing single quotes

Tags:

php

postgresql

I need to use a full text search to get the rows where the column contains the word O'henry

My code:

$word = "O'henry";
$sql = "SELECT * FROM mytable WHERE to_tsvector('english', col) @@ to_tsquery(:word) ";

$sth = $db->prepare( $sql );
$sth->execute(
    array(":word"=>$word)
);

I have two problems:

1) This query matches columns which contains the word O'henry, but it also matches columns containing, for example: "O yes, thierry henry is good sportsman."

2) If $word begins with quotes, for example 'henry, I got an error: syntax error in tsquery: "'henry", in spite of the fact that the search string was already escaped.

How can I solve this problem?

like image 425
Oto Shavadze Avatar asked May 13 '13 18:05

Oto Shavadze


2 Answers

The default english dictionary is going to tokenize your data treating the ' as a it would a space. You can use ts_debug to inspect how PostgreSQL/tsearch will treat your text:

psql=# SELECT * FROM ts_debug('english','o''henry');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | o     | {english_stem} | english_stem | {o}
 blank     | Space symbols   | '     | {}             |              | 
 asciiword | Word, all ASCII | henry | {english_stem} | english_stem | {henri}
(3 rows)

It's clear from this output that if you want to have postgres treat o'henry as a single word, you are going to have to either:

  • create a custom dictionary to handle words with apostrophes
  • or strip apostrophes from both your tsvector and tsquery before use.

I'd say the second option is by far the easiest:

$sql = "SELECT * FROM mytable WHERE plainto_tsvector('english', replace(col, '''','')) @@ to_tsquery(replace(:word,'''',''));"
like image 159
Chris Farmiloe Avatar answered Oct 02 '22 16:10

Chris Farmiloe


After reading through the documentation, I believe that this is a limitation within the types of strings accepted by to_tsquery. According to this page:

the input to to_tsquery must already follow the general rules for tsquery input

The tsquery input rules are defined here. Based on the examples given, the engine processes each input word into single-quoted strings, which will break on your input. As a workaround, I found a suggestion in this forum post (where the user was trying to properly escape the input 'ksan:

select *
from items
where to_tsvector(name) @@ to_tsquery(E'[\']ksan')

To apply this to your situation, you would need to pre-format your input using something like this:

$word = 'E' . str_replace("'", "[\']", $word);
like image 33
George Cummins Avatar answered Oct 02 '22 15:10

George Cummins