Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL Full Text Search Spanish character Ñ

I am facing an issue when doing full text search with PostgreSQL on text that contains de Spanish character 'Ñ'

When I try to tokenize the Spanish word 'AÑO' (year) I get the following results depending on if input is upper or lower case:

SELECT to_tsvector('spanish','AÑO'),to_tsquery('spanish','año')
"to_tsvector"   "to_tsquery"
"'aÑo':1"   "'año'"

As you can see result is not the same and it is case sensitive, so it makes my application full text search queries case sensitive if they contain this character.

Is there any way to overcome this issue? I have been searching on PostgreSQL documentation about full text search, and I don't know how to change this behaviour on installed dictionaries.

Thank you so much. Martí

like image 609
Marti Pàmies Solà Avatar asked Aug 08 '17 12:08

Marti Pàmies Solà


People also ask

How do I use full text search in PostgreSQL?

When performing a full-text searches on a PostgreSQL database, you must implement the to_tsvector() and to_tsquery() functions together with a match operator @@. The match operator returns a boolean value (either t or f) when you run a keyword search (tsquery) against a tsvector document.

Is PostgreSQL good for full text search?

Yes, You Can Keep Full-Text Search in Postgres You can get even deeper and make your Postgres full-text search even more robust, by implementing features such as highlighting results, or writing your own custom dictionaries or functions.

What is FTS in PostgreSQL?

To summarize, we learnt how to perform full-text search operation in PostgreSQL. If you liked our article, check out the book Mastering PostgreSQL 10 to understand how to perform operations such as indexing, query optimization, concurrent transactions, table partitioning, server tuning, and more.

What is to Tsvector in PostgreSQL?

1. tsvector. A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details).


1 Answers

The ability for to_tsvector to convert Ñ into ñ depends on the locale, and specifically on lc_ctype. Presumably your database is using an LC_CTYPE such as C whose knowledge is limited to US-ASCII.

Example with an LC_CTYPE compatible with Unicode:

test=> show lc_ctype;
  lc_ctype   
-------------
 fr_FR.UTF-8
(1 row)

test=> SELECT to_tsvector('spanish','AÑO'),to_tsquery('spanish','año');
 to_tsvector | to_tsquery 
-------------+------------
 'año':1     | 'año'
(1 row)

Note that the downcasing is what you expect.

Opposite example with C:

creation:

CREATE DATABASE cc lc_ctype 'C' template template0;

Note the lack of downcasing, as in the question:

cc=> show lc_ctype ;
 lc_ctype 
----------
 C
(1 row)

cc=> SELECT to_tsvector('spanish','AÑO'),to_tsquery('spanish','año');
 to_tsvector | to_tsquery 
-------------+------------
 'aÑo':1     | 'año'
(1 row)
like image 113
Daniel Vérité Avatar answered Jan 02 '23 12:01

Daniel Vérité