Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pg_trgm behaves differently on Ubuntu and Mac OS X

I am using pg_trgm to perform fuzzy string match where characters can be Chinese. Strangely, on my Ubuntu server, everything is fine, as following:

SELECT show_trgm('原作者');
> {0xa09182,0xcdfdbb,0x183afe,leD}

However, on my Mac, it does not work:

SELECT show_trgm('原作者');
> {}

I guess it is due to some strange encoding staff, but I examined all settings that I can possibly imagine, including:

SHOW SERVER_VERSION;
SHOW SERVER_ENCODING;
SHOW LC_COLLATE;
SHOW LC_CTYPE;

Where on Ubuntu it shows:

9.5.1
UTF8
en_US.UTF-8
en_US.UTF-8

and on Mac it shows:

9.5.3
UTF8
en_US.UTF-8
en_US.UTF-8

Also, the pg_trgm versions are both 1.1, according to SELECT * FROM pg_extension.

Could anyone help me to find why the pg_trgm not works on Unicode on my Mac?

like image 732
HanXu Avatar asked Oct 30 '22 00:10

HanXu


2 Answers

Reason for this is that pg_trgm depends on libc (system library shipped with OS) routines for classifying which characters are alphabetic and which aren't and this is (unfortunately) different between OSes. Apple Mac OS X is known for interpreting UTF-8 different way than other Unix/Unix-like systems. Character classification is different per-locale and is driven by category LC_CTYPE (and envvar of same name).

Check output of postgres=# \l and you should see Ctype column which tells you how characters are classified in your database.

  • If this is C (seen that on Apple MacOS X before) try to create database again specifying CREATE DATABASE foo ... LC_CTYPE="en_US.UTF-8"
  • If it is already en_US.UTF-8 it is very likely MacOS X doesn't classify UTF-8 Chinese characters as alphabetic in this locale (not surprising). Try LC_CTYPE="zh_CN.UTF-8" instead and that should work.
like image 102
blami Avatar answered Nov 01 '22 14:11

blami


In macOS, this is the issue of character encoding. Based on the language you have to explicitly flag the encoding type. default en_US.UTF-8 will definitely don't work. so:

Chinese : LC_CTYPE="zh_CN.UTF-8"

likewise, locale should be changed accordingly to the language. Although, there is no point of encode/decode Chinese in US English

You can create db:

CREATE DATABASE mydb WITH ENCODING='UTF8' LC_CTYPE='zh_CN.UTF-8' LC_COLLATE='zh_CN.UTF-8' OWNER=postgres TEMPLATE=template0 CONNECTION LIMIT=-1;
like image 37
Nilanka Manoj Avatar answered Nov 01 '22 14:11

Nilanka Manoj