Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL UTF-8 binary collation

I would like to have a collation which orders the UTF-8 encoding of 0x1234 below of 0x1235 regardless of the character mapping in the Unicode standard. MySQL uses utf8_bin for this. MSSQL apparently http://msdn.microsoft.com/en-us/library/ms143350.aspx have BIN and BIN2 collations. While finding these were easy, I can't even find a list of collations PostgreSQL supports much less answer to this specific question.

like image 511
chx Avatar asked Oct 15 '11 15:10

chx


People also ask

Can Postgres store binary data?

PostgreSQL provides two distinct ways to store binary data. Binary data can be stored in a table using the data type bytea or by using the Large Object feature which stores the binary data in a separate table in a special format and refers to that table by storing a value of type oid in your table.

Does PostgreSQL support UTF-8?

The character set support in PostgreSQL allows you to store text in a variety of character sets (also called encodings), including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended Unix Code), UTF-8, and Mule internal code.

What is C collation in PostgreSQL?

The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the LC_COLLATE and LC_CTYPE settings of a database cannot be changed after its creation.


3 Answers

The C locale will do. UTF-8 is designed so that byte ordering is also codepoint ordering. This is not trivial but consider how UTF-8 works:

Number range  Byte 1   Byte 2   Byte 3
0000-007F     0xxxxxxx
0080-07FF     110xxxxx 10xxxxxx
0800-FFFF     1110xxxx 10xxxxxx 10xxxxxx

When sorting binary data aka C locale, the first non-equal byte will determine the ordering. What we neeed to see that if two numbers encoded into UTF-8 differ then the first non-equal byte will be lower for the lower value. If the numbers are in different ranges then the first byte will indeed be lower for the lower number. Within the same range, the order is determined by literally the same bits as without encoding.

like image 55
chx Avatar answered Sep 28 '22 11:09

chx


Sort order of text depends on lc_collate (not on the system locale!). The system locale only serves as a default when creating the db cluster if you don't provide another locale.

The behaviour you are expecting only works with locale C. Read all about it in the fine manual:

The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values.

Emphasis mine. PostgreSQL 9.1 has a couple of new features for collation. Might be exactly what you are looking for.

like image 29
Erwin Brandstetter Avatar answered Sep 28 '22 09:09

Erwin Brandstetter


Postgres uses the collation defined by the system locale on cluster creation.

You might try to ORDER BY encode(column,'hex')

like image 24
Ramon Poca Avatar answered Sep 28 '22 09:09

Ramon Poca