Does ICU handle the collation of a list of strings of varying languages?

Question

My application may have strings comprised of different alphabets / languages in a single list. I can't seem to find any information on what the correct method for sorting these should be or any indication that ICU supports this functionality.

Example List:

Apple
яблоко
μήλο
Baby
βρέφος
ребенок

Example List:

Apple
яблоко
μήλο
Baby
βρέφος
ребенок

Zac Thompson · Accepted Answer

There is no sensible way to do this well. There is no universal sort for all languages, even within the same alphabet. Different languages (cultures, basically) have come up with different collation rules for how words should be sorted.

The only way to do this consistently at all, I think, is to use plain old codepoint sorting (e.g. in Java, String.compareTo).

You could come up with some heuristics, depending on what your data represents. You could group the strings based on guesses about the alphabet and language, and then use locale-specific sorting for each group. But you'd have to do this the hard way (code it yourself), I think, because you would guess differently depending on the terms (e.g. is 'mar' the English verb or the Spanish noun?). It's conceivable that you would end up with a worse result than the naive Unicode numerical sort, in terms of unpredictable "errors".

As with anything else, it depends on how much you can afford to put into the solution, and what kind of performance you need.

This suggestion is not the answer you're looking for: if there's any way to identify the locale when initially storing the strings, you should do so, and record it as part of the string's metadata. Then you won't have this problem.

Frédéric Grosshans · Answer

Withe all the caveats above, here is one "standard universal multilingual sorting" : the unicode collation algorithm (UCA), which is NOT the codepoint order. From a cursory glance at this page, ICU seems to handle the mixture of UCA and local preference.

Does ICU handle the collation of a list of strings of varying languages?

Tags:

sorting

collation

internationalization

icu

TJ Seabrooks

2 Answers

Zac Thompson

Frédéric Grosshans

Recent Activity

Donate For Us

Does ICU handle the collation of a list of strings of varying languages?

Tags:

sorting

collation

internationalization

icu

TJ Seabrooks

2 Answers

Zac Thompson

Frédéric Grosshans

Related questions

Recent Activity

Donate For Us