Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

to_tsvector in simple mode throwing away non english in some setups

On some pg installs I am noticing the following happens

sam=# select '你好 世界'::tsvector;
   tsvector    
---------------
 '世界' '你好'
(1 row)

sam=# select to_tsvector('simple', '你好 世界');
 to_tsvector 
-------------

(1 row)

Even though my db is configured like so:

MBA:bin sam$ ./psql -l
                              List of databases
   Name    | Owner | Encoding |   Collate   |    Ctype    | Access privileges
-----------+-------+----------+-------------+-------------+-------------------
 postgres  | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 |
 sam       | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 |
 template0 | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 | =c/sam           +
           |       |          |             |             | sam=CTc/sam
 template1 | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 | =c/sam           +
           |       |          |             |             | sam=CTc/sam
(4 rows)

On other similar setups I am seeing select to_tsvector('simple', '你好 世界'); correctly return the tokens.

How do I diagnose the simple tokeniser to figure out why it is chucking out these letters?

Simplest repro seems to be installing postgres via postgres app. Does not happen when installing postgres on ubuntu with a locale set.

like image 722
Sam Saffron Avatar asked Jun 27 '14 21:06

Sam Saffron


People also ask

What is To_tsvector?

to_tsvector parses a textual document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration.

How do I do a full text search in PostgreSQL?

In PostgreSQL, you use two functions to perform Full Text Search. They are to_tsvector() and to_tsquery(). Let's see how they work and to use them first. to_tsvector() function breaks up the input string and creates tokens out of it, which is then used to perform Full Text Search using the to_tsquery() function.

What is FTS in PostgreSQL?

To summarize, we learnt how to perform full-text search operation in PostgreSQL. If you liked our article, check out the book Mastering PostgreSQL 10 to understand how to perform operations such as indexing, query optimization, concurrent transactions, table partitioning, server tuning, and more.

How do I find text in PostgreSQL function?

We have search text by using the to_tsvector function in PostgreSQL. In to_tsvector, “ts” is defined as text search. In to_tsvector, the tsvector is the data type of to_tsvector function. This function will return the lexeme tokens with pointers in PostgreSQL.


1 Answers

Unfortunately, default parser used by text search highly depends on the database initialization and especially on lc_collate and the current database object encoding.

This is due to some inner working of the default text parser. It is vaguely documented:

Note: The parser's notion of a "letter" is determined by the database's locale setting, specifically lc_ctype. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them.

The important part is these comments in PostgreSQL source code:

/* [...]
 * Notes:
 *  - with multibyte encoding and C-locale isw* function may fail
 *    or give wrong result.
 *  - multibyte encoding and C-locale often are used for
 *    Asian languages.
 *  - if locale is C then we use pgwstr instead of wstr.
 */

and below:

/*
 * any non-ascii symbol with multibyte encoding with C-locale is
 * an alpha character
 */

Consequently, if you want to use the default parser with Chinese, make sure your database is initialized with the C locale and you have a multibyte encoding, so all characters above U+007F will be treated as alpha (including spaces such as IDEOGRAPHIC SPACE U+3000 !). Typically, the following initdb call will do what you expect:

initdb --locale=C -E UTF-8

Otherwise, Chinese characters will be skipped and treated as blank.

You can check this with debug function ts_debug. With a database initialized with lc_collate=en_US.UTF-8 or any other configuration where tokenization fails, you will get:

SELECT * FROM ts_debug('simple', '你好 世界');
 alias |  description  |   token   | dictionaries | dictionary | lexemes 
-------+---------------+-----------+--------------+------------+---------
 blank | Space symbols | 你好 世界 | {}            |            | 

Conversely, with lc_collate=C and a UTF-8 database (initialized as above), you will get the proper result:

SELECT * FROM ts_debug('simple', '你好 世界');
 alias |    description    | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+--------------+------------+---------
 word  | Word, all letters | 你好  | {simple}     | simple     | {你好}
 blank | Space symbols     |       | {}           |            | 
 word  | Word, all letters | 世界  | {simple}     | simple     | {世界}

It seems, however, that you mean to tokenize Chinese text where words are already separated by regular spaces, i.e. tokenization/segmentation does not happen within PostgreSQL. For this use case, I strongly suggest using a custom parser. This is especially true if you do not use other features of PostgreSQL simple parser, such as tokenizing URLs.

A parser tokenizing on space characters is very easy to implement. In fact, in contrib/test_parser, there is a sample code doing exactly that. This parser will work whatever the locale. There was a buffer overrun bug in this parser that was fixed in 2012, make sure you use a recent version.

like image 104
Paul Guyot Avatar answered Sep 18 '22 15:09

Paul Guyot