Just run into an issue. I'm trying to set up full text search on localized content (Russian in particular). The problem is default configuration (as well as my custom) does not deal with letter cases. Example:
SELECT * from to_tsvector('test_russian', 'На рынке появились новые рублевые облигации');
> 'На':1 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2
'На' is a stopword and should be removed, but it even does not lowercased in result vector. If I pass lowercased string, all works properly
SELECT * from to_tsvector('test_russian', 'на рынке появились новые рублевые облигации');
> 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2
Sure I can pass pre-lowercased strings, but manual says
The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words.
Config russian_test
looks like this:
create text search CONFIGURATION test_russian (COPY = 'russian');
CREATE TEXT SEARCH DICTIONARY russian_simple (
TEMPLATE = pg_catalog.simple,
STOPWORDS = russian
);
CREATE TEXT SEARCH DICTIONARY russian_snowball (
TEMPLATE = snowball,
Language = russian,
StopWords = russian
);
alter text search configuration test_russian
alter mapping for word
with russian_simple,russian_snowball;
But I actually get exactly the same results with built-in russian
config.
I tried ts_debug and tokens treated as word
, as I expected.
Any ideas?
Problem solved. The reason is database was initiated with default ("C") CType
and Collate
.
We used
initdb --locale=UTF-8 --lc-collate=UTF-8 --encoding=UTF-8 -U pgsql *PGSQL DATA DIR*
to recreate instance and
CREATE DATABASE "scratch"
WITH OWNER "postgres"
ENCODING 'UTF8'
LC_COLLATE = 'ru_RU.UTF-8'
LC_CTYPE = 'ru_RU.UTF-8';
to recreate db and simple dictionary now works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With