Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

postgres collation differences. osx v ubuntu

So, i've recently come to realize that collation is a huge deal on postgres, and that many comments refer to OSX / locale support as "broken", which hasn't enlightened me. for the purposes of this question, i'm ignoring the table/column default aspects of collation, and specifying it explicitly.

  • my laptop is osx with postgres 9.2.4
  • my server is ubuntu with postgres 9.1.9

common to both:

  # show lc_collate ;
   en_US.UTF-8
  # show lc_ctype ;
   en_US.UTF-8

on my laptop:

select ',' < '-' collate "en_US.UTF-8" as result;
  true

now, my server does not have collation "en_US.UTF-8", but it does have "en_US.utf8" (which i recognize is not the same thing, though i would expect it to behave the same)

select ',' < '-' collate "en_US.utf8" as result;
 false

so, here's where i'm freaking out. "C" order would always say (for both machines) that ',' is less than '-', which my brain would agree with.

which utf8 implementation is correct? and if someone could point me at the definition that would help, as mostly i've only been able to find accusations of "broken" leveled at osx. So i'd be worried that i've been wrong my entire life thinking that comma orders before hyphen, but enter a reasonably reliant arbiter of text and unicode etc. python. which on the ubuntu server yields:

>>> print u',' < u'-', ',' < '-'
True True

So, I'm feeling a lot like this collation concept is more broken on my ubuntu server than my osx server. but i don't have a "proper" collation to create my "en_US.UTF-8" collation from ala "create collation", so i'm lost as to how to create parity, or which answer (true/false) i should be using as the correct reference. (besides personally siding with ascii order for what are, after all, ascii characters).

so, in a nutshell, which is the proper answer for en_US.UTF-8 ?

like image 396
Woody Anderson Avatar asked Nov 14 '13 00:11

Woody Anderson


1 Answers

In the Default Unicode Collation Element Table you can see these two entries:

002C  ; [*0220.0020.0002] # COMMA
002D  ; [*020D.0020.0002] # HYPHEN-MINUS

Here, the primary weight of COMMA is greater than the primary weight of HYPHEN-MINUS, therefore HYPHEN-MINUS sorts before COMMA.

Note that this is the expected sort order according to the Unicode Collation Algorithm with the default weights. If you expect sort order by ASCII byte values, you get a different order. And there are other valid orders. But if the locale is named "en_US.UTF-8" (or "en_US.utf8", same thing), then you'd probably expect Unicode order. But that's between you and your operating system vendor.

like image 115
Peter Eisentraut Avatar answered Oct 22 '22 21:10

Peter Eisentraut