Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is this Unicode NFC conversion correct?

I am working on a C project that needs to generate "case insensitive" normalized forms of pieces of Unicode text. I have chosen to define the normalized form as that achieved by first converting to normalization form NFD, then applying the Unicode case folding algorithm, and finally converting the result to Unicode normalization form NFC.

I am relying on ICU's C API for its Unicode representation and utility functions, and it was fairly straightforward to implement my scheme using ICU's unorm_normalize() and u_strFoldCase() functions. But one of my tests is failing, and I don't understand why. ICU seems to be generating a different NFC form than I expected.

The input sequence consists of these BMP code points:

U+0020, U+1EA5, U+0328, U+1EC4, U+031C

Via a debugger, I determined that ICU and I agree about the intermediate result after case folding:

U+0020 U+0061 U+0328 U+0302 U+0301 U+0065 U+031C U+0302 U+0303

Note in particular that the earlier conversion to form NFD moved character U+031C into the middle of the decomposition of U+1EC4, as appropriate based on relative CCC numbers for the characters involved. That's part of what I'm trying to test.

Now the good part: according to ICU, the NFC normalization of the folded character sequence is

U+0020 U+0105 U+0302 U+0301 U+1ec5 U+031C

whereas I think it should be

U+0020 U+0105 U+0302 U+0301 U+0065 U+031C U+0302 U+0303

because the three trailing combining characters are already in canonical order, and there is no canonical composition of U+0065 and U+031C.

So, two questions:

  1. Which is the correct NFC form?
  2. If ICU is correct then why?
like image 598
John Bollinger Avatar asked Mar 19 '23 06:03

John Bollinger


1 Answers

ICU is correct. To understand why, have a look at the Canonical Composition Algorithm which is defined in chapter 3 of the Unicode Standard:

D117 Canonical Composition Algorithm: Starting from the second character in the coded character sequence (of a Canonical Decomposition or Compatibility Decomposition) and proceeding sequentially to the final character, perform the following steps:

R1 Seek back (left) in the coded character sequence from the character C to find the last Starter L preceding C in the character sequence.

R2 If there is such an L, and C is not blocked from L, and there exists a Primary Composite P which is canonically equivalent to the sequence <L, C>, then replace L by P in the sequence and delete C from the sequence.

You also have to understand the preceding definitions, especially:

D115 Blocked: Let A and C be two characters in a coded character sequence <A, ... C>. C is blocked from A if and only if ccc(A)=0 and there exists some character B between A and C in the coded character sequence, i.e., <A, ... B, ... C>, and either ccc(B)=0 or ccc(B) >= ccc(C).

Now consider the following substring of your input sequence:

U+0065 U+031C U+0302 U+0303

We start with character U+031C and seek back to the last starter which is U+0065:

U+0065 U+031C U+0302 U+0303
L      C

C is obviously not blocked from L but there's no primary composite equivalent to <L, C> so we proceed to the next character:

U+0065 U+031C U+0302 U+0303
L             C

Now C still isn't blocked from L (that's what you probably misunderstood) because ccc(U+031C) = 220 < 230 = ccc(U+0302)and there exists a primary composite U+00EA equivalent to U+0065 U+0302. So we replace L and delete C:

U+00EA U+031C U+0303
L             C

Again, C isn't blocked from L and the primary composite U+1EC5 is equivalent to U+00EA U+0303 so the final result of the composition is:

U+1EC5 U+031C

This matches the output from ICU.

like image 154
nwellnhof Avatar answered Apr 01 '23 05:04

nwellnhof