I try to produce a PDF text file with Hebrew text.
I managed to produce a simple file. file is here
The file opens in Adobe Acrobat Reader perfectly, showing the string "אאא ווו תתת". It opens perfectly also in IE.
The problem is other viewers show it badly: Google Chrome / Google Docs show it without all "ו" occurances (that is, three letters "ו" disapear!)
Mozilla Firefox show it very badly, showing some letters many times and in odd places on the page...
What am I doing wrong?? What is wrong in the file?
A link to the file is here
I know this is a tough question.
Any help will be appreciated...
Fonts in PDF are PDF objects - Font
dictionaries, containing numerous parameters and sub-dictionaries, necessary to select glyphs, show them and translate character codes to logical (Unicode) representation for content extraction. Fonts in layman terms -- as we see them as *.ttf or *.pfb files -- are called font programs, either embedded or external, and are referred to by one of sub-dictionaries of Font
objects.
Fonts
are divided into two groups:
Font
object (by predefined name or explicitly) or, under special circumstances, constructed according to defined rules by viewer application.The file in question doesn't contain simple fonts, and we won't discuss them any further -- but, note, over-simplistic description doesn't even start to reflect any of real-life complexity.
CIDFont
, and, similar to encoding for simple fonts, a CMap
object, that maps character codes to character selectors, which, in PDF, are always CIDs
-- integers up to 65536.Now, character selector (CID
) is not, in general, directly used to select glyphs from font program. For CIDFont
of CIDFontType2
type, its dictionary contains CIDToGIDMap
entry, that, obviously, maps CID
to glyph identifiers. Those GIDs
are, at last, used to select glyphs from embedded font program (which, for CIDFontType2
font, is a TrueType font program (do not confuse with Font
object of TrueType Subtype
)).
Font
object can have ToUnicode
resource, that maps CIDs to Unicode values for indexing, searching and extraction. It's called ToUnicode Cmap
(as it follows similar syntax), but it should not to be confused with CMap
object, mentioned above.
In what I call a simple case (and, I think, sensible decision), CMap
is predefined Identity-H name, CIDToGIDMap
is a predefined Identity name, and, therefore, character codes extracted from a string (argument to text showing operator) are always 2-byte numbers that, effectively, directly select glyphs from embedded TrueType program. From my experience, it's most common scenario, and as it appears, that's the case, against which common software is tested.
But, it's not the case with file in question.
In our file, text showing operator, effectively, gets this string:
0x000a 0x000a 0x000a 0x20 0x0020 0x0020 0x0020 0x20 0x0025 0x0025 0x0025
Of course there are no 'groups', they are here because I made them, based on CMap
that contains 2 ranges:
<20> <20>
<0000> <19FF>
To make a long story short, if we look up character codes in CMap
and get CIDs, then look up CIDs in CIDToGIDMap
and get GIDs, then look up GIDs in embedded David-Bold font and get Unicode values, here's the table
Code CID GID Unicode Name
0x000a 10 180 05EA tav
0x0020 32 159 05D5 vav
0x0025 37 154 05D0 alef
0x20 228 03 0020 space
Now we have enough information to speculate, what confuses viewer applications
In my first attempt, I suggested it's 32
code (and CID
) that's used for non-space character (see comment above). This assumption was based on a case, several years ago, when (older version of) Acrobat didn't show character with 0x20
code, when it's at the end of a string -- assuming it to be space
, when in fact, according to encoding vector (of a simple font), it was another character.
I changed this:
0x0020
to 0x0004
in content stream;CIDToGIDMap
to GID=159;Widths
array of CID=4 to 'vav' width;ToUnicode cmap
was adjusted accordingly.<0020> 32
string from CMAP
- not reflected in a file, linked in comment)Well, it did help, but unfortunately, some of viewers still rejected to comply to specification.
Then I thought, that maybe variable character code width was the issue.
I returned to the original file and changed this:
0x20
to 0x00e4
in content stream;<20> 228
to <00e4> 228
in CMAP
;codespacerange
<20> <20>
in CMAP
deleted;codespacerange
<20> <20>
in ToUnicode Cmap
deleted.This file appears to open perfectly in all viewers, mentioned in original question and comments below. Miraculously, 0x0020
code and 32
CID
do not interfere.
The conclusion, I think, can be this:
Given current state of affairs, PDF-creators are NOT advised to mix single and double byte codes in font encoding (CMAP
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With