to_tsvector in simple mode throwing away non english in some setups

Tags:

postgresql

On some pg installs I am noticing the following happens

sam=# select '你好 世界'::tsvector;
   tsvector    
---------------
 '世界' '你好'
(1 row)

sam=# select to_tsvector('simple', '你好 世界');
 to_tsvector 
-------------

(1 row)

Even though my db is configured like so:

MBA:bin sam$ ./psql -l
                              List of databases
   Name    | Owner | Encoding |   Collate   |    Ctype    | Access privileges
-----------+-------+----------+-------------+-------------+-------------------
 postgres  | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 |
 sam       | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 |
 template0 | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 | =c/sam           +
           |       |          |             |             | sam=CTc/sam
 template1 | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 | =c/sam           +
           |       |          |             |             | sam=CTc/sam
(4 rows)

On other similar setups I am seeing select to_tsvector('simple', '你好世界'); correctly return the tokens.

How do I diagnose the simple tokeniser to figure out why it is chucking out these letters?

Simplest repro seems to be installing postgres via postgres app. Does not happen when installing postgres on ubuntu with a locale set.

722

asked Jun 27 '14 21:06

1 Answers

Unfortunately, default parser used by text search highly depends on the database initialization and especially on lc_collate and the current database object encoding.

This is due to some inner working of the default text parser. It is vaguely documented:

Note: The parser's notion of a "letter" is determined by the database's locale setting, specifically lc_ctype. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them.

The important part is these comments in PostgreSQL source code:

/* [...]
 * Notes:
 *  - with multibyte encoding and C-locale isw* function may fail
 *    or give wrong result.
 *  - multibyte encoding and C-locale often are used for
 *    Asian languages.
 *  - if locale is C then we use pgwstr instead of wstr.
 */

and below:

/*
 * any non-ascii symbol with multibyte encoding with C-locale is
 * an alpha character
 */

Consequently, if you want to use the default parser with Chinese, make sure your database is initialized with the C locale and you have a multibyte encoding, so all characters above U+007F will be treated as alpha (including spaces such as IDEOGRAPHIC SPACE U+3000 !). Typically, the following initdb call will do what you expect:

initdb --locale=C -E UTF-8

Otherwise, Chinese characters will be skipped and treated as blank.

You can check this with debug function ts_debug. With a database initialized with lc_collate=en_US.UTF-8 or any other configuration where tokenization fails, you will get:

SELECT * FROM ts_debug('simple', '你好 世界');
 alias |  description  |   token   | dictionaries | dictionary | lexemes 
-------+---------------+-----------+--------------+------------+---------
 blank | Space symbols | 你好 世界 | {}            |            |

Conversely, with lc_collate=C and a UTF-8 database (initialized as above), you will get the proper result:

SELECT * FROM ts_debug('simple', '你好 世界');
 alias |    description    | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+--------------+------------+---------
 word  | Word, all letters | 你好  | {simple}     | simple     | {你好}
 blank | Space symbols     |       | {}           |            | 
 word  | Word, all letters | 世界  | {simple}     | simple     | {世界}

It seems, however, that you mean to tokenize Chinese text where words are already separated by regular spaces, i.e. tokenization/segmentation does not happen within PostgreSQL. For this use case, I strongly suggest using a custom parser. This is especially true if you do not use other features of PostgreSQL simple parser, such as tokenizing URLs.

A parser tokenizing on space characters is very easy to implement. In fact, in contrib/test_parser, there is a sample code doing exactly that. This parser will work whatever the locale. There was a buffer overrun bug in this parser that was fixed in 2012, make sure you use a recent version.

104

answered Sep 18 '22 15:09

Paul Guyot

Related questions
                            
                                Update query in createNativeQuery
                            
                                SQLAlchemy: foreign key to multiple tables
                            
                                How to set `DateTime` in Ecto schemas and `timestamp with time zone` (`timestamptz`) PostgreSQL type in migrations?
                            
                                How to restore database from dump or sql file in docker using volume?
                            
                                what are the advantages of using plpgsql in postgresql
                            
                                In mysql or postgres, is there a limit to the size of an IN (1,2,n) statement?
                            
                                Database testing in python, postgresql
                            
                                Where are NUMERIC precision and scale for a field found in the pg_catalog tables?
                            
                                MERGE syntax used to UPSERT or INSERT on duplicate UPDATE
                            
                                How can I limit database query time during web requests?
                            
                                MySQL UNIX_TIMESTAMP for PostgreSQL
                            
                                PostgreSQL syntax error when using EXECUTE in Function
                            
                                Is it possible to create view for insert query
                            
                                Dump postgres data with indexes
                            
                                JodaTime and BeanPropertySqlParameterSource
                            
                                PostgreSQL trigger not working - neither BEFORE nor AFTER DELETE
                            
                                Postgresql -> deadlock from simple update. I can't get the cause
                            
                                Lots of "COMMIT;" in the PostgreSQL log of slow queries
                            
                                Update record of a cursor where the table name is a parameter
                            
                                Postgres SELECT DISTINCT, only highest values [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

to_tsvector in simple mode throwing away non english in some setups

Tags:

full-text-search

postgresql

Sam Saffron

People also ask

1 Answers

Paul Guyot

Recent Activity

Donate For Us