Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chrome Save as PDF changing CJK characters

We are seeing an issue when we try to save a webpage containing CJK characters as a PDF using Chrome's Print option.

The character rendered by chrome in the PDF visually looks the same but the Unicode is different.

Below is a basic HTML.

<HTML>

<HEAD>
  Test Character
</HEAD>

<BODY>
  子
</BODY>

</HTML>

The character if the HTML is opened in chrome is
https://graphemica.com/%E5%AD%90

But the corresponding character in PDF is
https://graphemica.com/%E2%BC%A6

Link for the HTML and PDF
https://1drv.ms/f/s!Aq5YnvMOo4V8iVzdRyjmX3X5L0TD

Firstly I would want to understand why this is happening and then what can be the workaround for the same. Is there any utility which can convert my Character into what Chrome is going to render it in PDF.

OS Version : MacOS 10.13.6 (17G65)

Chrome Version : 75.0.3770.100 (Official Build) (64-bit)

like image 678
Abhishek Garg Avatar asked Jul 03 '19 18:07

Abhishek Garg


1 Answers

My understanding is that a PDF does not actually contain the string of characters you see when the document is rendered, but rather sequences of font glyphs and supporting lookup tables that map those glyphs back to character codes. In OP's test case, the font utilized for the cjk character on macOS is STSongti-SC-Regular and its glyph id is hex 0436.

I can only reproduce OP's behavior on macOS. On both Linux and Windows, I see the glyph mapped to the character that was originally in the html file: U+5B50. An example comparison is shown below in output from the peepdf utility:

enter image description here

The operations to go from character-to-glyph and glyph-to-character are done in the onCharsToGlyphs() and populate_glyph_to_unicode() methods of skia's SkFontHost_mac.cpp respectively. On macOS, both of these rely on calls to CTFontGetGlyphsForCharacters() from the Core Text lib, iterating through every possible character to build the mapping tables.

I boiled that approach down to following test code, printing out each glyph id and corresponding character code for a given font:

NSString *fontName = @"STSongti-SC-Regular";
CTFontRef fontRef = CTFontCreateWithName((CFStringRef)fontName, 10.0, NULL);

CFDataRef bitmap = CFCharacterSetCreateBitmapRepresentation(kCFAllocatorDefault, CTFontCopyCharacterSet(fontRef));
CFIndex length = CFDataGetLength(bitmap);

const UInt8* bits = CFDataGetBytePtr(bitmap);

for (int i = 0; i < length; i++) {
    int mask = bits[i];
    if (!mask)
        continue;
    for (int j = 0; j < 8; j++) {
        CGGlyph glyph;
        UniChar unichar = (UniChar)((i << 3) + j);
        if (mask & (1 << j) && CTFontGetGlyphsForCharacters(fontRef, &unichar, &glyph, 1)) {
            NSLog(@"%04x %04x", glyph, unichar);
        }
    }
}

Looking through the output, there are two character codes for our glyph code:

0436 2f26
0436 5b50

It encounters 2f26 first, which is significant because when building the lookup table, if a character code has already been determined for a glyph (and its value is >= 0x20), it does not get overwritten:

if (CTFontGetGlyphsForCharacters(ctFont, utf16, glyphs, count)) {
    // ...
    if (glyphToUnicode[glyphs[0]] < 0x20) {
        glyphToUnicode[glyphs[0]] = codepoint;
    }
}

So, ultimately what I believe is happening is:

  1. Chrome correctly determines the STSongti-SC-Regular glyph id for 5B50 to be 0436. It uses this glyph for the cjk character in the pdf.
  2. Then, it builds the glyph-to-charcode lookup table for STSongti-SC-Regular by iterating through all possible characters. Since 0436 maps to two codes and it encounters 2f26 first, that's what gets recorded, and is the value that is returned when copying and pasting from the document.
like image 109
cody Avatar answered Oct 22 '22 13:10

cody