So, i've recently come to realize that collation is a huge deal on postgres, and that many comments refer to OSX / locale support as "broken", which hasn't enlightened me. for the purposes of this question, i'm ignoring the table/column default aspects of collation, and specifying it explicitly.
common to both:
# show lc_collate ;
en_US.UTF-8
# show lc_ctype ;
en_US.UTF-8
on my laptop:
select ',' < '-' collate "en_US.UTF-8" as result;
true
now, my server does not have collation "en_US.UTF-8", but it does have "en_US.utf8" (which i recognize is not the same thing, though i would expect it to behave the same)
select ',' < '-' collate "en_US.utf8" as result;
false
so, here's where i'm freaking out. "C" order would always say (for both machines) that ',' is less than '-', which my brain would agree with.
which utf8 implementation is correct? and if someone could point me at the definition that would help, as mostly i've only been able to find accusations of "broken" leveled at osx. So i'd be worried that i've been wrong my entire life thinking that comma orders before hyphen, but enter a reasonably reliant arbiter of text and unicode etc. python. which on the ubuntu server yields:
>>> print u',' < u'-', ',' < '-'
True True
So, I'm feeling a lot like this collation concept is more broken on my ubuntu server than my osx server. but i don't have a "proper" collation to create my "en_US.UTF-8" collation from ala "create collation", so i'm lost as to how to create parity, or which answer (true/false) i should be using as the correct reference. (besides personally siding with ascii order for what are, after all, ascii characters).
so, in a nutshell, which is the proper answer for en_US.UTF-8 ?
In the Default Unicode Collation Element Table you can see these two entries:
002C ; [*0220.0020.0002] # COMMA
002D ; [*020D.0020.0002] # HYPHEN-MINUS
Here, the primary weight of COMMA is greater than the primary weight of HYPHEN-MINUS, therefore HYPHEN-MINUS sorts before COMMA.
Note that this is the expected sort order according to the Unicode Collation Algorithm with the default weights. If you expect sort order by ASCII byte values, you get a different order. And there are other valid orders. But if the locale is named "en_US.UTF-8" (or "en_US.utf8", same thing), then you'd probably expect Unicode order. But that's between you and your operating system vendor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With