Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalization needed after case folding

Given a NFC normalized string, applying full case folding to that string, can I assume that the result is NFC normalized too?

I don't understand what the Unicode standard is trying to tell me in this quote:

Normalization also interacts with case folding. For any string X, let Q(X) = NFC(toCasefold(NFD(X))). In other words, Q(X) is the result of normalizing X, then case folding the result, then putting the result into Normalization Form NFC format. Because of the way normalization and case folding are defined, Q(Q(X)) = Q(X). Repeatedly applying Q does not change the result; case folding is closed under canonical normalization for either Normalization Form NFC or NFD.

like image 631
dalle Avatar asked Mar 18 '26 02:03

dalle


2 Answers

A Unicode string might not be in NFC after case folding. An example is U+00DF (LATIN SMALL LETTER SHARP S) followed by U+0301 (COMBINING ACUTE ACCENT).

X = U+00DF U+0301
NFC(X) = U+00DF U+0301
toCasefold(NFC(X)) = U+0073 U+0073 U+0301
NFC(toCasefold(NFC(X))) = U+0073 U+015B
like image 93
nwellnhof Avatar answered Mar 21 '26 23:03

nwellnhof


You have asked two questions:

Question 1: Is toCasefold(NFC(X)) binary equal to NFC(toCasefold(NFC(X)))?

The standard doesn't explicitly answer this question. (I would expect the answer is yes, that case folding does not affect normalization, but I have no proof.)

Question 2: What is the Unicode standard telling me in the quote?

The standard is only saying it is not necessary to do case folding again after canonical normalization. In other words, canonical normalization (to NFC or NFD form) does not change the case of any characters from uppercase to lowercase or vice versa. This doesn't answer your first question.

It is not saying whether or not it is necessary to do canonical normalization again after case folding.

like image 42
Anthony Faull Avatar answered Mar 21 '26 23:03

Anthony Faull