I am facing an issue when doing full text search with PostgreSQL on text that contains de Spanish character 'Ñ'
When I try to tokenize the Spanish word 'AÑO' (year) I get the following results depending on if input is upper or lower case:
SELECT to_tsvector('spanish','AÑO'),to_tsquery('spanish','año')
"to_tsvector" "to_tsquery"
"'aÑo':1" "'año'"
As you can see result is not the same and it is case sensitive, so it makes my application full text search queries case sensitive if they contain this character.
Is there any way to overcome this issue? I have been searching on PostgreSQL documentation about full text search, and I don't know how to change this behaviour on installed dictionaries.
Thank you so much. Martí
When performing a full-text searches on a PostgreSQL database, you must implement the to_tsvector() and to_tsquery() functions together with a match operator @@. The match operator returns a boolean value (either t or f) when you run a keyword search (tsquery) against a tsvector document.
Yes, You Can Keep Full-Text Search in Postgres You can get even deeper and make your Postgres full-text search even more robust, by implementing features such as highlighting results, or writing your own custom dictionaries or functions.
To summarize, we learnt how to perform full-text search operation in PostgreSQL. If you liked our article, check out the book Mastering PostgreSQL 10 to understand how to perform operations such as indexing, query optimization, concurrent transactions, table partitioning, server tuning, and more.
1. tsvector. A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details).
The ability for to_tsvector
to convert Ñ
into ñ
depends on the locale, and specifically on lc_ctype
. Presumably your database is using an LC_CTYPE
such as C
whose knowledge is limited to US-ASCII
.
Example with an LC_CTYPE
compatible with Unicode:
test=> show lc_ctype; lc_ctype ------------- fr_FR.UTF-8 (1 row) test=> SELECT to_tsvector('spanish','AÑO'),to_tsquery('spanish','año'); to_tsvector | to_tsquery -------------+------------ 'año':1 | 'año' (1 row)
Note that the downcasing is what you expect.
Opposite example with C
:
creation:
CREATE DATABASE cc lc_ctype 'C' template template0;
Note the lack of downcasing, as in the question:
cc=> show lc_ctype ; lc_ctype ---------- C (1 row) cc=> SELECT to_tsvector('spanish','AÑO'),to_tsquery('spanish','año'); to_tsvector | to_tsquery -------------+------------ 'aÑo':1 | 'año' (1 row)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With