I am working on a C project that needs to generate "case insensitive" normalized forms of pieces of Unicode text. I have chosen to define the normalized form as that achieved by first converting to normalization form NFD, then applying the Unicode case folding algorithm, and finally converting the result to Unicode normalization form NFC.
I am relying on ICU's C API for its Unicode representation and utility functions, and it was fairly straightforward to implement my scheme using ICU's unorm_normalize()
and u_strFoldCase()
functions. But one of my tests is failing, and I don't understand why. ICU seems to be generating a different NFC form than I expected.
The input sequence consists of these BMP code points:
U+0020, U+1EA5, U+0328, U+1EC4, U+031C
Via a debugger, I determined that ICU and I agree about the intermediate result after case folding:
U+0020 U+0061 U+0328 U+0302 U+0301 U+0065 U+031C U+0302 U+0303
Note in particular that the earlier conversion to form NFD moved character U+031C into the middle of the decomposition of U+1EC4, as appropriate based on relative CCC numbers for the characters involved. That's part of what I'm trying to test.
Now the good part: according to ICU, the NFC normalization of the folded character sequence is
U+0020 U+0105 U+0302 U+0301 U+1ec5 U+031C
whereas I think it should be
U+0020 U+0105 U+0302 U+0301 U+0065 U+031C U+0302 U+0303
because the three trailing combining characters are already in canonical order, and there is no canonical composition of U+0065 and U+031C.
So, two questions:
ICU is correct. To understand why, have a look at the Canonical Composition Algorithm which is defined in chapter 3 of the Unicode Standard:
D117 Canonical Composition Algorithm: Starting from the second character in the coded character sequence (of a Canonical Decomposition or Compatibility Decomposition) and proceeding sequentially to the final character, perform the following steps:
R1 Seek back (left) in the coded character sequence from the character C to find the last Starter L preceding C in the character sequence.
R2 If there is such an L, and C is not blocked from L, and there exists a Primary Composite P which is canonically equivalent to the sequence <L, C>, then replace L by P in the sequence and delete C from the sequence.
You also have to understand the preceding definitions, especially:
D115 Blocked: Let A and C be two characters in a coded character sequence <A, ... C>. C is blocked from A if and only if ccc(A)=0 and there exists some character B between A and C in the coded character sequence, i.e., <A, ... B, ... C>, and either ccc(B)=0 or ccc(B) >= ccc(C).
Now consider the following substring of your input sequence:
U+0065 U+031C U+0302 U+0303
We start with character U+031C
and seek back to the last starter which is U+0065
:
U+0065 U+031C U+0302 U+0303
L C
C is obviously not blocked from L but there's no primary composite equivalent to <L, C>
so we proceed to the next character:
U+0065 U+031C U+0302 U+0303
L C
Now C still isn't blocked from L (that's what you probably misunderstood) because ccc(U+031C) = 220 < 230 = ccc(U+0302)
and there exists a primary composite U+00EA
equivalent to U+0065 U+0302
. So we replace L and delete C:
U+00EA U+031C U+0303
L C
Again, C isn't blocked from L and the primary composite U+1EC5
is equivalent to U+00EA U+0303
so the final result of the composition is:
U+1EC5 U+031C
This matches the output from ICU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With