Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Font unicode glyph mapping to actual characters

I'm trying to display all glyphs in a font. I'm using GetFontUnicodeRanges to get the available characters, then I create a bitmap with all the available characters and their index next to each one.

I used the font "Wingdings 2" as a test case, and compared it to what I see in Windows' charmap.exe. I see that while all the characters appear, some characters appear more than once (total of 480 glyphs in that non-unicode font), and the positions are not the same as in charmap (for instance, medium sized circle glyph, in charmap located as 0x97, and in the font it is glyph 0xF097 and I also think it is the one in 0x2014).

I want to use the font as the "regular" way, meaning, I want to see the same data as in charmap.exe (and in a side note I would also like to know if a font is a unicode font or ascii font, as charmap shows). Basically, you can say I am trying to write my own charmap from scratch.

How can I fill in that missing data? I was looking through the Windows' fonts and text APIs, but couldn't find anything to help me, so I must be missing some relevant APIs. What are they?

like image 914
Itai Bar-Haim Avatar asked Feb 15 '12 09:02

Itai Bar-Haim


3 Answers

After struggling a lot with GetFontData and the lack of documentation (well, not exactly lack of, but it is really not well organized, and some data is indeed missing), I found a way writing my own CharMap. Here's what I've found during development:

  1. The documentation will tell you to use a "trick" possible since the glyph location data comes right after the arrays in cmap table. It doesn't mean it is IN the cmap table. Actually, they are in the loca table.

  2. You would also need to read the head table for the location format flag (offset 34), and the maxp table for the number of glyphs field (offset 4).

  3. It seems that in symbol fonts (you can tell if a font is a symbol font if the cmap header encoding id is 0, at least in TTF format 4, which is the Microsoft format) the characters are added 0xF000 to their actual index, so instead of the regular ASCII codes, you get a Unicode value in the far end of the Unicode table. I subtracted 0xF000 from each character code and tested on Wingdings[2,3] and Webdings fonts and it worked just fine.

I used the official documentation a lot: www.microsoft.com/typography/tt/ttf_spec/ttch02.doc, and the reference code: http://support.microsoft.com/kb/241020.

The reference code is written in C, so in order to write it in C# I read all the data to byte[] buffers, and "manually" read each element from it.

like image 177
Itai Bar-Haim Avatar answered Nov 01 '22 12:11

Itai Bar-Haim


I went through this nightmare years ago too and I know a lot about all this stuff now. I figured I should pitch in and provide some answers.

1) You can not assume that 'loca' is following the 'cmap'. The order can vary by font. The location of each block is defined by the OffsetTable which begins generally at byte 0 of the font file. (http://www.microsoft.com/typography/otspec/otff.htm)

2) You can not assume that "cmap header encoding id is 0, at least in TTF format 4" means symbol fonts. I know for a fact that certain old arabic fonts also use that encoding. To this date, I still do not know how to differentiate them. Windows does it but I do not know how. I do not know how to know for sure that a font is a symbol font. Even checking the OS/2 table for the code page bit 32 isn't enough in many case.

3) You can not simply use the magic 0xF000 number and add it to your small 0-255 number to get the character that will give you the glyph mapping you are going for. That is because those small 0 to 255 "ASCII" code will vary depending on your system locale.

Symbol font are specials in the way that windows processes them.

Unlike normal font where the mapping between glyphs and character is static, symbol fonts mapping varies based on the system default code page for non-unicode application aka CP_ACP.

For example, Pretend your symbol font have this glyph : '%'. If your system is using CP 1252 by default, then to render this glyph you, for example, have to render the character value '0xC2'.

If your system is using CP 1251 by default, then to render this glyph you, for example, have to render the character value '0x416' which is entirely different.

Said otherwise, the font's unicode ranges varies based on the default non-unicode code page!

After investigation, we discovered that the valid character value for fonts are the values obtained by converting 0 through 255 are if they were CP_ACP value to unicode.

What does this mean? This means that you want to use MultiByteToWideChar with CP_ACP to get the mapping from values 0 to 255 to their localized unicode value based on your system locale (CP_ACP).

So, doing that will give you a map like :

ASCII -> localized non-static UNICODE
0x00 -> 0x00
0x01 -> 0x01
0x02 -> 0x02
...
0xC2 -> 0x416 <----- This is correct : the value will be different in some cases.
...
0xE3 -> 0xE3

The 0xF000 to 0xF0FF values are the static UNICODE values : they never change.

So to get the glyph ID for a "localized non-static UNICODE", you first use your map above to find the corresponding ASCII value and then you add 0xF000 to that and then you get the glyph id for that.

Of course, non of this non-sense is documented by MS... or I could never find it.

like image 36
Claude Peloquin Avatar answered Nov 01 '22 10:11

Claude Peloquin


I've never looked at "WingDings 2" in detail, but it's very common for glyphs to be reused for different characters. For example, uppercase Roman A and uppercase Greek alpha are frequently the same glyph.

However, I guess the equality of 0x97, 0xF097 and 0x2014 is some kind of hack to deal with windows-1252. In the windows-1252 codepage, 0x97 is an em-dash, which is 0x2014 in Unicode. 0xF097 is in the private use area; I guess it is providing a Unicode-compatible (and reversible) way of encoding the windows-1252 0x97.

In my experience, the most reliable way to get an unambiguous list of the unicode characters supported by a font is to parse the cmap table from the ttf file. This is a bit of a chore (cmap supports something like six different encodings) but it is documented online. You can use the GetFontData function to get the raw data, or parse the ttf directly.

charmap uses the GetFontData function and the code includes the string "cmap", suggesting that charmap is also doing this.

The Windows SDK Debugging Tools include logger.exe, which records all the APIs used by an app. You can use this if you want to be really sure what charmap is doing.

like image 28
arx Avatar answered Nov 01 '22 12:11

arx