Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF Tj command with angle brackets?

I'm trying to figure out where in an uncompressed PDF v1.4 document the Times font is used.

The /Font object describing the Times font within the PDF is object 65 as follows:

65 0 obj
<</Type /Font
/Subtype /TrueType
/BaseFont /PXAAAD+TimesNewRoman,Italic
/FirstChar 1
/LastChar 35
/Widths [250 333 333 333 500 500 500 500 500 500 500 500 500 500 333 722 722 833 666 610 500 556 500 443 443 500 277 443 500 389 389 277 500 443 500]
/FontDescriptor 205 0 R
/ToUnicode 206 0 R>>
endobj

It refers to a /FontDescriptor object 205 to further define the Times font object, and to a /ToUnicode map in object 206 which describes byte-to-unicode character mapping. EDIT: After Ritsaert's initial answer to the question below, I'm adding the font's /ToUnicode object here, to provide the mentioned CMap.

206 0 obj
<</Length 208 0 R>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
35 beginbfchar
<01> <0020>
<02> <0028>
<03> <0029>
<04> <002d>
<05> <0030>
<06> <0031>
<07> <0032>
...
<23> <0101>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

endstream
endobj

I've now tracked down the use of the Times font object to a /Page object (one of many) like the following one which refers to font object 65 through the /F4 reference in its page /Resources:

12 0 obj
<</Type /Page
/Parent 2 0 R
/MediaBox [0 0 432 648]
/Contents 92 0 R
/Resources <</Font <</F1 62 0 R
/F3 64 0 R
/F4 65 0 R>>
/ProcSet [/PDF /Text]>>
/Group <</S /Transparency
/CS /DeviceRGB>>>>
endobj

The /Contents stream (object 92 in the PDF file) is then full of text objects (enclosed in BT and ET), none of which contains text, but instead they use angle brackets full of numbers. For example, here is the only reference to the Times font /F4 whose use I'm trying to find:

92 0 obj
<</Length 93 0 R>>
stream
...
BT
0.5020 g
72.0000 615.1512 Td
/F4 12.0000 Tf
<0605> Tj
ET
...
endstream
endobj

But what do the angle brackets and the number <0605> refer to? A specific glyph in the font table? Looking at the PDF reference and section 5.3.2 I can't find mention of the angle brackets.

EDIT: Given the above code and the accepted answer that <0605> is a hex encoding of text, the <0605> are the entries <06> and <05> in the CMap object 206 and thus map to unicodes <0031> and <0030> respectively. That means, the string <0605> refers to U+0031 (a "1") and to U+0030 (a "0"), such that the Times font is used for the string "10" on page object 12.

like image 227
Jens Avatar asked Mar 31 '14 13:03

Jens


1 Answers

What is going on here:

  • in the content stream the Tj command is given the string <0605> to draw. a string in between <> is a hex string and hence the characters #6 and #5 are drawn. In 3.2.3 of the linked PDF reference is the notation explained.

  • Just before the text draw command the font F4 is selected using the Tf command.

  • Given the resource fork of the page containing the font is referenced as object 65 revision 0. This font object is a subsetted Truetype font where glyphs 1..35 are defined. No Encoding is specified (thus WinAnsiEncoding is used). So the embedded subsetted font rearranged the characters in the font in a non standard manner (occurs quite often).

Now if you want to know how these glyph IDs are linked to Unicode characters: the font has a ToUnicode link where a stream contains a CMAP defining the mapping. This should be sufficient to convert the string to an Unicode string.

like image 99
Ritsaert Hornstra Avatar answered Nov 30 '22 02:11

Ritsaert Hornstra