Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does ICU handle the collation of a list of strings of varying languages?

My application may have strings comprised of different alphabets / languages in a single list. I can't seem to find any information on what the correct method for sorting these should be or any indication that ICU supports this functionality.

Example List:

  • Apple
  • яблоко
  • μήλο
  • Baby
  • βρέφος
  • ребенок
like image 637
TJ Seabrooks Avatar asked Dec 14 '22 02:12

TJ Seabrooks


2 Answers

There is no sensible way to do this well. There is no universal sort for all languages, even within the same alphabet. Different languages (cultures, basically) have come up with different collation rules for how words should be sorted.

The only way to do this consistently at all, I think, is to use plain old codepoint sorting (e.g. in Java, String.compareTo).

You could come up with some heuristics, depending on what your data represents. You could group the strings based on guesses about the alphabet and language, and then use locale-specific sorting for each group. But you'd have to do this the hard way (code it yourself), I think, because you would guess differently depending on the terms (e.g. is 'mar' the English verb or the Spanish noun?). It's conceivable that you would end up with a worse result than the naive Unicode numerical sort, in terms of unpredictable "errors".

As with anything else, it depends on how much you can afford to put into the solution, and what kind of performance you need.

This suggestion is not the answer you're looking for: if there's any way to identify the locale when initially storing the strings, you should do so, and record it as part of the string's metadata. Then you won't have this problem.

like image 156
Zac Thompson Avatar answered Apr 27 '23 07:04

Zac Thompson


Withe all the caveats above, here is one "standard universal multilingual sorting" : the unicode collation algorithm (UCA), which is NOT the codepoint order. From a cursory glance at this page, ICU seems to handle the mixture of UCA and local preference.

like image 39
Frédéric Grosshans Avatar answered Apr 27 '23 05:04

Frédéric Grosshans