Can I get PostgreSQL to sort rows by a string column respecting the accents?
I found out that it's possible to define a custom collation having "ks" (colStrength) set to "level2", which would mean that it's accent-sensitive.
However, when I try to actually sort using that collation, the order seem to be accent-insensitive.
There is an extensive blog post about this by a PostgreSQL developer, let's use the same ICU locale) like so:
CREATE TABLE test (string text);
INSERT INTO test VALUES ('bar'), ('bat'), ('bär');
CREATE COLLATION "und1" (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
CREATE COLLATION "und2" (provider = icu, deterministic = false, locale = 'und-u-ks-level2');
CREATE COLLATION "und3" (provider = icu, deterministic = false, locale = 'und-u-ks-level3');
SELECT * FROM test ORDER BY string collate "und1";
SELECT * FROM test ORDER BY string collate "und2";
SELECT * FROM test ORDER BY string collate "und3";
All three collations give me the same order: bar < bär < bat, although an accent-sensitive order would be bar < bat < bär
Do I misunderstand the collation capabilities? Is there a way to get an accent-sensitive order?
Also, is there a way to see what options are there for the default built-in collations? I don't see, for example, the used "ks" level in the pg_collation table data.
Yes, PostgreSQL can sort strings accent-sensitively using ICU collations, but there are a few important nuances to get it working correctly.
You're correctly using ICU collations with ks=level2, which should enable accent-sensitive comparisons. However, the und locale (undetermined language) may not provide the sorting behavior you're expecting. ICU needs a language context to apply proper collation rules.
Instead of using und, try using a real language locale, such as en-u-ks-level2 for English or fr-u-ks-level2 for French, depending on the language context of your data.
CREATE COLLATION "en_level2" (provider = icu, deterministic = false, locale = 'en-u-ks-level2');
SELECT * FROM test ORDER BY string COLLATE "en_level2";
CREATECOLLATION "en_level2" (provider = icu, deterministic = false, locale = 'en-u-ks-level2'); SELECT * FROM test ORDER BY string COLLATE "en_level2";
This should result in the expected order: bar < bat < bär.
und doesn’t workThe und locale often defaults to root collation rules, which may not define strong enough rules for distinguishing accents. Using a specific language gives ICU more context for handling accent-sensitive and locale-specific rules.
You can list all available ICU collations with:
SELECT * FROM pg_collation WHERE provider = 'icu';
SELECT* FROM pg_collation WHERE provider = 'icu';
Unfortunately, the pg_collation catalog does not expose the ICU options like ks, but you can infer them from the locale field.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With