Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is PostgreSQL stripping HTML entities in ts_headline()?

I'm writing a prototype of a full-text search feature which will return found documents' "headlines" in the search result. Here's a slightly modified example from the Postgres docs:

SELECT ts_headline('english',
  'The most common type of search is to find all documents containing given query terms <b>and</b> return them in <order> of their similarity to the query.',
  to_tsquery('query & similarity'),
  'StartSel = XXX, StopSel = YYY');

What I would expect would be something like

"documents containing given XXXqueryYYY terms <b>and</b> return them in <order> of their XXXsimilarityYYY to the XXXqueryYYY."

What I get instead is

"documents containing given XXXqueryYYY terms  and  return them in   of their XXXsimilarityYYY to the XXXqueryYYY."

It looks like everything that looked remotely like a HTML tag is getting stripped and replaced with a single space character (note the double spaces around the and).

I didn't find any place in the docs that would state that Postgres is assuming the input text is HTML and the user would want the tags stripped off. The api allows overriding of StartSel and StopSel from the default <b> and </b>, so I'd think it was meant to serve a more general use-case.

Is there some setting or comment in the docs that I'm missing?

like image 742
nietaki Avatar asked Oct 26 '16 13:10

nietaki


1 Answers

<b> and </b> are recognized as tag token. By default they are ignored. You need to modify existing configuration or create new one:

=# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
=# alter text search configuration english_tag
   add mapping for tag with simple;

Then tags aren't skipped:

=# select * from ts_debug('english_tag', 'query <b>test</b>');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | query | {english_stem} | english_stem | {queri}
 blank     | Space symbols   |       | {}             | (null)       | (null)
 tag       | XML tag         | <b>   | {simple}       | simple       | {<b>}
 asciiword | Word, all ASCII | test  | {english_stem} | english_stem | {test}
 tag       | XML tag         | </b>  | {simple}       | simple       | {</b>}

But even in this case ts_headline will skip tags. Because it is hardcoded:

#define HLIDREPLACE(x)  ( (x)==TAG_T )

There is a workaround of course. It is possible to create your own text search parser extension. Example on GitHub. And change

#define HLIDREPLACE(x)  ( (x)==TAG_T )

to

#define HLIDREPLACE(x)  ( false )
like image 185
Artur Avatar answered Nov 16 '22 06:11

Artur