Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the handling of combining characters in the Unicode Collation Algorithm work?

Tags:

unicode

uca

I maintain an open-source, pure-Python implementation of the Unicode Collation Algorithm called pyuca.

While it meets my needs in sorting Ancient Greek text (and seems to meet the needs of many other people), I'm looking to improve its coverage of rarer cases by getting it to the point where it passes the entire suite of official conformance tests.

However, 1,869 of the tests (just over 1%) fail. The first failure is at 0332 0334 which the test files suggest should get the sort key | 004A 0021 | 0002 0002 |.

pyuca, however, forms the sort key | 0021 004A | 0002 0002 |.

At first I thought this might be due to lack of support for non-starter characters (S2.1.1 thru S2.1.3 of the algorithm in the latest spec). However, my subsequent implementation of this part did nothing to change the sort key and a manual working through the algorithm on paper also fails to trigger that section which has me wondering if I'm just missing something.

The relevant steps in the algorithm are:

S2.1.1 If there are any non-starters following S, process each non-starter C.
S2.1.2 If C is not blocked from S, find if S + C has a match in the table.
S2.1.3 If there is a match, replace S by S + C, and remove C.

The key phrase is "If there is a match". In the test mentioned above that fails, there is no match for 0332 0334 and so this part of the algorithm cannot explain why the sort key should be in a different order to what my implementation produces.

Can anyone explain what part of the UCA would form a sort key like the test file suggests?

like image 758
James Tauber Avatar asked Feb 01 '14 04:02

James Tauber


People also ask

How do you combine Unicode characters?

Depending from the application or browser there are two ways to use the Unicode Combining Diacritical Marks: With ā (a macron) as example, you may try to type in the 'a' first followed by the decimal code ̄ or ALT+ (it must be the + from the numeric keypad) followed by the hexadecimal code 0304 (i.e U+0304).

What is ICU collation?

The ICU Collation Service provides string comparison capability with support for appropriate sort orderings for each of the locales you need. In the event that you have a very unusual requirement, you are also provided the facilities to customize orderings.


1 Answers

Does it work better if you shove the string into Normalization Form D first? (Step 1.)

This is an utter wild guess based on the fact that 0332 0334 is not in NFD. I haven't tried to work through the algorithm at all.

like image 190
Andrew Plotkin Avatar answered Sep 22 '22 08:09

Andrew Plotkin