Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading a pdf version >= 1.5, how to handle Cross Reference Stream Dictionary

Tags:

pdf

I'm trying to read the xref table of a pdf version >= 1.5.

the xref table is an object:

58 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<CB05990F613E2FCB6120F059A2BCA25B><E2ED9D17A60FB145B03010B70517FC30>]/Index[38 39]/Info 37 0 R/Length 96/Prev 67529/Root 39 0 R/Size 77/Type/XRef/W[1 2 1]>>stream
hÞbbd``b`:$AD`­Ì ‰Õ Vˆ8âXAÄ×HÈ$€t¨  – ÁwHp·‚ŒZ$ìÄb!&F†­ .#5‰ÿŒ>(more here but can't paste)
endstream
endobj

as you can see

  • /FlatDecode
  • /Index [38 39], that is 39 entries in the stream
  • /W [1 2 1] that is each entry is 1 + 2 + 1 = 4 bytes long
  • /Root 39 0 R that is root object is number 39

BUT :

the decompressed stream is 195 bytes long (39 * 5 = 195). So the length of an entry is 4 or 5.

Here is the first inflated bytes

02 01 00 10 00 02 00 02 cd 00 02 00 01 51 00 02 00 01 70 00 02 00 05 7a 00 02
            ^^

if entry length is 4 then the root entry is a free object (see the ^^) !!

if the entry is 5: how to interpret the fields of one entry (reference is implicitly made to PDF Reference, chapter 3.4.7 table 3.16 ) ?

For object 38, the first of the stream: it seems, as it is of type 2, to be the 16 object of the stream object number 256, but there is no object 256 in my pdf file !!!

The question is: how shall I handle the 195 bytes ?

like image 284
tschmit007 Avatar asked May 22 '14 17:05

tschmit007


1 Answers

A compressed xref table may have been compressed with one of the PNG filters. If the /Predictor value is set to '10' or greater ("a Predictor value greater than or equal to 10 merely indicates that a PNG predictor is in use; the specific predictor function used is explicitly encoded in the incoming data")1, PNG row filters are supplied inside the compressed data "as usual" (i.e., in the first byte of each 'row', where the 'row' is of the width in /W).

Width [1 2 1] plus Predictor byte:

02 01 00 10 00
02 00 02 cd 00
02 00 01 51 00
02 00 01 70 00
02 00 05 7a 00
02 .. .. .. ..

After applying the row filters ('2', or 'up', for all of these rows), you get this:

01 00 10 00
01 02 ed 00
01 03 3e 00
01 04 ae 00
01 09 28 00
.. .. .. ..

Note: calculated by hand; I might have made the odd mistake here and there. Note that the PNG 'up' filter is a byte filter, and the result of the "up" filter is truncated to 8 bits for each addition.

This leads to the following Type 1 XRef references ("type 1 entries define objects that are in use but are not compressed (corresponding to n entries in a cross-reference table)."):2

#38 type 1: offset 10h, generation 0
#39 type 1: offset 2EDh, generation 0
#40 type 1: offset 33Eh, generation 0
#41 type 1: offset 4AEh, generation 0
#42 type 1: offset 928h, generation 0

1 See LZW and Flate Predictor Functions in PDF Reference 1.7, 6th Ed, Section 3.3: Filters.

2 As described in your Table 3.16 in PDF Ref 1.7.

like image 158
Jongware Avatar answered Nov 08 '22 20:11

Jongware