We are seeing an issue when we try to save a webpage containing CJK characters as a PDF using Chrome's Print option.
The character rendered by chrome in the PDF visually looks the same but the Unicode is different.
Below is a basic HTML.
<HTML>
<HEAD>
Test Character
</HEAD>
<BODY>
子
</BODY>
</HTML>
The character if the HTML is opened in chrome is
https://graphemica.com/%E5%AD%90
But the corresponding character in PDF is
https://graphemica.com/%E2%BC%A6
Link for the HTML and PDF
https://1drv.ms/f/s!Aq5YnvMOo4V8iVzdRyjmX3X5L0TD
Firstly I would want to understand why this is happening and then what can be the workaround for the same. Is there any utility which can convert my Character into what Chrome is going to render it in PDF.
OS Version : MacOS 10.13.6 (17G65)
Chrome Version : 75.0.3770.100 (Official Build) (64-bit)
My understanding is that a PDF does not actually contain the string of characters you see when the document is rendered, but rather sequences of font glyphs and supporting lookup tables that map those glyphs back to character codes. In OP's test case, the font utilized for the cjk character on macOS is STSongti-SC-Regular
and its glyph id is hex 0436
.
I can only reproduce OP's behavior on macOS. On both Linux and Windows, I see the glyph mapped to the character that was originally in the html file: U+5B50
. An example comparison is shown below in output from the peepdf
utility:
The operations to go from character-to-glyph and glyph-to-character are done in the onCharsToGlyphs()
and populate_glyph_to_unicode()
methods of skia's SkFontHost_mac.cpp
respectively. On macOS, both of these rely on calls to CTFontGetGlyphsForCharacters()
from the Core Text lib, iterating through every possible character to build the mapping tables.
I boiled that approach down to following test code, printing out each glyph id and corresponding character code for a given font:
NSString *fontName = @"STSongti-SC-Regular";
CTFontRef fontRef = CTFontCreateWithName((CFStringRef)fontName, 10.0, NULL);
CFDataRef bitmap = CFCharacterSetCreateBitmapRepresentation(kCFAllocatorDefault, CTFontCopyCharacterSet(fontRef));
CFIndex length = CFDataGetLength(bitmap);
const UInt8* bits = CFDataGetBytePtr(bitmap);
for (int i = 0; i < length; i++) {
int mask = bits[i];
if (!mask)
continue;
for (int j = 0; j < 8; j++) {
CGGlyph glyph;
UniChar unichar = (UniChar)((i << 3) + j);
if (mask & (1 << j) && CTFontGetGlyphsForCharacters(fontRef, &unichar, &glyph, 1)) {
NSLog(@"%04x %04x", glyph, unichar);
}
}
}
Looking through the output, there are two character codes for our glyph code:
0436 2f26 0436 5b50
It encounters 2f26
first, which is significant because when building the lookup table, if a character code has already been determined for a glyph (and its value is >= 0x20
), it does not get overwritten:
if (CTFontGetGlyphsForCharacters(ctFont, utf16, glyphs, count)) {
// ...
if (glyphToUnicode[glyphs[0]] < 0x20) {
glyphToUnicode[glyphs[0]] = codepoint;
}
}
So, ultimately what I believe is happening is:
STSongti-SC-Regular
glyph id for 5B50
to be 0436
. It uses this glyph for the cjk character in the pdf.STSongti-SC-Regular
by iterating through all possible characters. Since 0436
maps to two codes and it encounters 2f26
first, that's what gets recorded, and is the value that is returned when copying and pasting from the document.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With