I'm building a small app that includes Esperanto words in my database, so I have words like ĉapelojn and brakhorloĝo, with "special" characters.
Using PostgreSQL 9.4.4 I have a words
table with the following schema:
lingvoj_dev=# \d words
Table "public.words"
Column | Type | Modifiers
-------------+-----------------------------+----------------------------------------------------
id | integer | not null default nextval('words_id_seq'::regclass)
translated | character varying(255) |
meaning | character varying(255) |
times_seen | integer |
inserted_at | timestamp without time zone | not null
updated_at | timestamp without time zone | not null
Indexes:
"words_pkey" PRIMARY KEY, btree (id)
But the following query gives some strange output:
lingvoj_dev=# SELECT w."translated" FROM "words" AS w ORDER BY w."translated" desc limit 10;
translated
------------
ĉu
ŝi
ĝi
ĉevaloj
ĉapelojn
ĉapeloj
ĉambro
vostojn
volas
viro
(10 rows)
The ordering is inconsistent - I'd be okay with all of the words starting with special characters being at the end, but all of the words starting with ĉ should be grouped together and they're not! Why do ŝi and ĝi come in between ĉu and ĉevaloj?
The server encoding is UTF8, and the collation is en_AU.UTF-8.
edit: It looks like it's sorting all of the special characters as equivalent - it's ordering correctly based on the second character in each word. How do I make PostgreSQL see that ĉ, ŝ and ĝ are not equivalent?
The ORDER BY clause in PostgreSQL is used together with the SELECT statement to sort table data. The table data can either be sorted in ascending or descending order. By default, the data is sorted in ascending order.
Use the ORDER BY clause in the SELECT statement to sort rows. Use the ASC option to sort rows in ascending order and DESC option to sort rows in descending order. The ORDER BY clause uses the ASC option by default.
The character set support in PostgreSQL allows you to store text in a variety of character sets (also called encodings), including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended Unix Code), UTF-8, and Mule internal code.
I'd be okay with all of the words starting with special characters being at the end...
Use collate "C":
SELECT w."translated"
FROM "words" AS w
ORDER BY w."translated" collate "C" desc limit 10;
See also Different behaviour in “order by” clause: Oracle vs. PostgreSQL
The query can be problematic when using ORM. The solution may be to recreate the database with the LC_COLLATE = C
option, as suggested by the OP in the comment. There is one more option - change the collation for a single column:
ALTER TABLE "words" ALTER COLUMN "translated" TYPE text COLLATE "C";
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With