Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgresql full text search tokenizer

Just run into an issue. I'm trying to set up full text search on localized content (Russian in particular). The problem is default configuration (as well as my custom) does not deal with letter cases. Example:

SELECT * from to_tsvector('test_russian', 'На рынке появились новые рублевые облигации');
> 'На':1 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2

'На' is a stopword and should be removed, but it even does not lowercased in result vector. If I pass lowercased string, all works properly

SELECT * from to_tsvector('test_russian', 'на рынке появились новые рублевые облигации');
> 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2

Sure I can pass pre-lowercased strings, but manual says

The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words.

Config russian_test looks like this:

create text search CONFIGURATION test_russian (COPY = 'russian');

CREATE TEXT SEARCH DICTIONARY russian_simple (
    TEMPLATE = pg_catalog.simple,
    STOPWORDS = russian
);

CREATE TEXT SEARCH DICTIONARY russian_snowball (
    TEMPLATE = snowball,
    Language = russian,
    StopWords = russian
);

alter text search configuration test_russian 
    alter mapping for word
    with russian_simple,russian_snowball;

But I actually get exactly the same results with built-in russian config.

I tried ts_debug and tokens treated as word, as I expected.

Any ideas?

like image 727
Tommi Avatar asked Aug 08 '13 08:08

Tommi


1 Answers

Problem solved. The reason is database was initiated with default ("C") CType and Collate. We used

initdb --locale=UTF-8 --lc-collate=UTF-8 --encoding=UTF-8 -U pgsql *PGSQL DATA DIR* 

to recreate instance and

CREATE DATABASE "scratch"
  WITH OWNER "postgres"
  ENCODING 'UTF8'
  LC_COLLATE = 'ru_RU.UTF-8'
  LC_CTYPE = 'ru_RU.UTF-8';

to recreate db and simple dictionary now works.

like image 139
Tommi Avatar answered Oct 18 '22 07:10

Tommi