Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

en_us.UTF8 collation for non-English languages

I am trying the PostgreSQL database for the first time, after having worked for some time with MySQL. My environment is a shared hosting with cPanel and phpPgAdmin. One thing that puzzles me is the database collation. My hosting's cPanel always creates databases with the Encoding, Collation, and Character Type set to UTF8, en_US.UTF-8, and en_US.UTF-8 respectively. I don't seem to have any way to change it, because the database is created via cPanel, there are no options there, and according to this answer, these parameters can only be changed by re-creating the database with the desired settings.

So I was wondering: does it actually matter? If the collation is set to en_us.UTF8, what will happen with non-English and even non-Latin strings, such as Russian or Hebrew? How will they be sorted?

Update: I am puzzled because in MySQL I used to just choosing the utf8mb4_unicode_ci collation and not care about a specific language. I wonder how it works in comparison to country-language-specific collations in PostgreSQL.

like image 400
texnic Avatar asked Oct 31 '25 20:10

texnic


1 Answers

The collation behaviour will depend on the underlying operating system's locale support. In general I think you should expect fairly "dumb" sorting, probably by Unicode codepoint, for characters outside the current locale. Probably the default collatoin for the unicode collation algorithm.

I'm having a hard time quickly finding sample data that collates differently in order to test that.

Note, though, that you can declare collations on a per-column or per-operation basis with the COLLATE term, e.g.

CREATE TABLE sometable(
   ...,
   companyname text COLLATE "ru_RU",
   ...
);

or

SELECT *
FROM sometable
ORDER BY companyname COLLATE "ru_RU"

or

SELECT *
FROM sometable
WHERE companyname < 'Компания' COLLATE "ru_RU"

so you aren't limited to the database's collation.

(Doing this is really the only way to properly handle data in assorted/mixed languages, since there's no one correct collation for all the data anyway).

Edit showing an example:

test=> SHOW lc_collate;
 lc_collate 
------------
 en_AU.utf8
(1 row)

test=> SELECT * FROM (VALUES ('z'),('aa')) x(y) ORDER BY y;
 y  
----
 a
 aa
 z
(3 rows)

craig=> SELECT * FROM (VALUES ('z'),('aa')) x(y) ORDER BY y COLLATE "da_DK";
 y  
----
 a
 z
 aa
(3 rows)

This also shows that the concept of a language-independent collation is total nonsense, and what people really usually mean is "English-like collation" or "collate by Unicode ordinal" (which is in mostly English-like order).

like image 186
Craig Ringer Avatar answered Nov 04 '25 02:11

Craig Ringer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!