Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode case folding to upper case

I'm trying to implement a library for reading Microsoft CFB (Compound File Binary) Format files, according to the official specification of that format. The specification is available from this site.

In a nutshell - some of the structures of the file are stored in a red-black tree. I've got a problem with the comparison predicate used for storing these structures in that tree. The specification says that, if the names (the strings are stored as UTF-16, the standard in Windows APIs) of these structures are different, it is necessary to iterate through every UTF-16 code point and :

(...) convert to upper-case with the Unicode Default Case Conversion Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.

The <2> reference says that :

or Windows XP and Windows Server 2003: The compound file implementation conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.

However, when I looked up the referenced case folding file, and read the UTR #21 "Case Mapping" referenced there, I realized that the case folding is defined as an operation that bears much more resemblance to lower-casing, rather than upper-casing.

By using CaseFolding-4.txt, we can obtain the case folding mappings of upper-case letters to lower-case ones. The mapping is always 1-to-1, since full foldings (those that expand to multiple characters) aren't needed here. However, the reverse mapping of lower-case letters to upper-case ones isn't straightforward anymore. For example,

0392; C; 03B2; # GREEK CAPITAL LETTER BETA
03D0; C; 03B2; # GREEK BETA SYMBOL

Thus, we have no way of knowing whether 03B2 should be converted to 0392 or 03D0. Does the standard define something like folding to upper-case? Maybe I should use case folding, and then convert to upper-case? Or have I understood the specification completely wrong?

like image 442
Daniel Kamil Kozar Avatar asked Nov 24 '13 21:11

Daniel Kamil Kozar


2 Answers

Summary: The wording used by Microsoft is...confusing to say the least. It appears that simple upper case mapping should be done, though I can't be certain.


Background

Part of the confusion might be the difference between case folding and case mapping. Case mapping maps every character to a designated case. Case folding, while it is based on lower-casing, is defined to result in case-less characters (UTR #21 §1.3).

Now there are two variants of case mapping and case folding, simple and full. Unlike the simple transformation, The full one can change string length, and as you rightly point out is not needed here. The specification specifically mentions simple, and is probably the only clear thing in this answer. I do feel the need to mention for future reference that the the current Unicode Standard (6.3.0) mentions that the default case transformation is the full one, though the version Microsoft references (3.1.1) does not appear to make this distinction.

Spec Analysis

(...) convert to upper-case with the Unicode Default Case Conversion Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.

To me this quote appears to suggest they want upper case, and simply made an error by saying case folding instead of case mapping. But then comes that reference you quoted:

For Windows XP and Windows Server 2003: The compound file implementation conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.

They actually mention the case folding data file! At this point, I'm not sure what to think. My main line of thought is that Microsoft wants case folding though erroneously thought that it was based on upper casing rather than lower casing. This is even a stretch though, but its the closest I've been able to come to reconciling this possible contradiction, and I hope there's a better explanation.

I've found in section 2.6.1 the following which supports some form of upper-casing:

[...] the directory entry name is compared using a special case-insensitive upper-case mapping, described in Red-Black Tree.

Note that they do in fact use the term mapping here.

The exception list

Taking a look at the exception list for the mentioned Windows XP and Windows Server 2003, most entries are subtractions, suggesting code points Microsoft wants to keep distinct. However, in the table, the code points are actually listed in reverse order to the Unicode case folding data file.

One interpretation of this is that it's just a display quirk. This idea is shot down by the last row where they subtract the case transformation 0x03C2 -> 0x03C2. This transformation does not exist in the data file since the transformation 0x03C2 -> 0x03C3 does (an unlisted case transformation is considered to transform to itself).

Another interpretation is that they do in fact erroneously believe that its the reverse mapping that's the correct one. As you mentioned though, this runs into trouble, as the reverse mapping is not always straightforward. Otherwise, this interpretation would be fine.

A third interpretation is to consider their reference to the Unicode case folding data file wrong. This of course makes me feel uneasy, but if they actually did mean case mapping originally, they might have just provided the link as a quick reference point. The exception list they mention does have column headers such as "Lowercase UTF-16 code point", but we know that case folding is in fact case-less.

As an aside, I did look at the exception list for the later operating systems, hoping to gain some more insight. I found more confusion. In particular the addition of 0x03C3 -> 0x03A3 troubles me. Since the exception list and the Unicode file list their code points in the opposite order, it appears that the transformation is already in the data file and doesn't need to be added. This part of the specification does not want to be understood!

Conclusion

If you've read all of the above, you'll probably guess that this conclusion is going to be less than ideal. Clearly at one or more points, the specification is in error, but it's hard to tell where. Really there are three possibilities depending on your interpretation as to what kind of case transformation needs to be done.

  • Simple upper case mapping
  • Simple case folding, followed by simple upper case mapping
  • Simple case folding

To me it seems like Microsoft does in fact want upper casing. From there I believe that the case folding reference is an error, and as such my guess is they just want simple upper case mapping.

I highly doubt it's the last simple case folding only option though. Both of the other options would give very similar results with only a small amount of code points possibly giving different results.

It seems like the only way to know for sure would be to either contact Microsoft, or painstakingly look at binaries to see which method is followed.

like image 129
DPenner1 Avatar answered Nov 08 '22 15:11

DPenner1


In 3.13 Default Case Algorithms (p. 115) of The Unicode Standard Version 6.2 – Core Specification the text refers to UnicodeData.txt. This contains:

03B2;GREEK SMALL LETTER BETA;Ll;0;L;;;;;N;;;0392;;0392
03D0;GREEK BETA SYMBOL;Ll;0;L;<compat> 03B2;;;;N;GREEK SMALL LETTER CURLED BETA;;0392;;0392

which indicates that the Greek small letter Beta should indeed map to the Greek Beta symbol, and as an aside indicates that the two symbols have some level of compatibility. It also contains the remainder of the bidirectional case conversion you are looking for. You may also need to look at SpecialCasing.txt for boundary cases.

like image 23
Pekka Avatar answered Nov 08 '22 15:11

Pekka