Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't read some PDF files with iTextSharp

Tags:

c#

pdf

eof

itext

I have a Win32 application that reads PDFs using iTextSharp which inserts an image into the document as a seal.

It works fine with 99% of the files we are processing over a year, but these days some files just don't read. When I execute the code below:

string inputfile = "C:\test.pdf";
PdfReader reader = new PdfReader(inputfile);

It gives the exception:

System.NullReferenceException occurred
  Message="Object reference not set to an instance of an object."
  Source="itextsharp"
  StackTrace:
       em iTextSharp.text.pdf.PdfReader.ReadPages()
       em iTextSharp.text.pdf.PdfReader.ReadPdf()
       em iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[] ownerPassword)
       em iTextSharp.text.pdf.PdfReader..ctor(String filename)
       em MyApp.insertSeal() na C:\MyApp\Stamper.cs:linha 659

The pdf files that throw these exception can be normally read by adobe pdf and when I open one of these files with Acrobat and save it I can read this saved file with my application.

Are the files corrupted but still can be opened with Adobe Reader?

I am sharing with you two samples of files.

A file that NOT work : Not-Ok-Version.pdf

And a file that works, after a opened and saved it with Acrobat. Download it here OK-Version.pdf

like image 339
Guilherme de Jesus Santos Avatar asked Dec 06 '22 23:12

Guilherme de Jesus Santos


1 Answers

Here's the (java, sorry) source for readPages:

protected internal void ReadPages() {
  catalog = trailer.GetAsDict(PdfName.ROOT);
  rootPages = catalog.GetAsDict(PdfName.PAGES);
  pageRefs = new PageRefs(this);
}

trailer,catalog,rootPages, andpageRefs` are all member variables of PdfReader.

If the trailer or root/catalog object of a PDF are simply missing, your PDF is REALLY BADLY BROKEN. It's more likely that the xref table is a bit off, and the objects in question simply aren't exactly where they're supposed to be (which is Bad, but recoverable).

HOWEVER, when PdfReader first opens a PDF, it parses ALL the objects in the file, and converts them to the appropriate PdfObject-derived classes.

What it isn't doing is checking to see that the object number claimed by the xref table and the object number read in from the file Actually Match. Highly Unlikely, but possible. Bad software could write out their PDF objects in the wrong order but keep the byte offsets in the xref table correct. Software that overrode the object number from the xref table with the number from that particular byte offset in the file would be fine.

iText is not fine.

I still want to see the PDF.


Yep. That PDF is broken alright. Specifically:

The file's first 70kb or so define a pretty clean little PDF. Changes were then appended to the PDF.

Check that. Someone attempted to append changes to the PDF and failed. Badly. To understand just how badly, let me explain some of the internal syntax of a PDF, illustrated with this example:

%%PDF1.6
1 0 obj
<</Type/SomeObject ...>>
endobj
2 0 obj
<</Type/SomeOtherObj /Ref 1 0 R>>
endobj
3 0 obj
...
endobj
<etc>
xref
0 10
0000000000 65535 f
0000000010 00001 n
0000000049 00002 n
0000000098 00003 n
...
trailer
<</Root 4 0 R /Size 10>>
startxref 124
%%EOF

So we have a header/version "%%PDF1.v", a list of objects (the ones here are called dictionaries), a cross (x) reference table listing the byte offsets and object numbers of all the objects in the list, and a trailer giving the root object & the number of objects in the PDF, and the byte offset to the 'x' in 'xref'.

You can append changes to an existing PDF. To do so you just add any new or changed objects after the existing %%EOF, a cross reference table to those new objects, and a trailer. The trailer of an appended change should include a /Prev key with the byte offset to the previous cross reference table.

In your NOT-OKAY pdf, someone tried to append changes to a PDF, AND FAILED HORRIBLY.

The original PDF is still there, intact. That's what Reader shows you, and what you get when you save the PDF. I hacked off everything after the first %%EOF in a hex editor, and the file was fine.

So here's the layout of your NOT-OKAY pdf:

%PDF1.4.1
1 0 obj...
2 through 7
xref
0 7
<healthy xref>
trailer <</Size 8 /Root 6 0 R /Info 7 0 R>>
startxref 68308
%%EOF

So far so good. Here's where things get ugly

<binary garbage>
endstream
endobj
xref 
0 7
<horribly wrong xref>
trailer <</ID [...] /Info 1 0 R /Root 2 0 R /Size 7>>
startxref 223022
%%EOF

The only thing RIGHT about that section is the startxref value.

Problems:

  • The second trailer has no /Prev key.
  • ALL the byte offsets in the second xref table are wrong.
  • The is part of a "stream" object, but the beginning of that object IS MISSING. Streams should look something like this

1 0 obj
<</Type/SomeType/Length 123>>
stream
123 bytes of data
endstream
endobj

The end of this file is made up of some portion of a (compressed I'd imagine) stream... but without the dictionary at the beginning telling us what filters its using and how long it is (to say nothing of any missing data), you can't do anything with it.

I suspect that someone tried to completely rebuild this PDF, then accidentally wrote the original 70kb over the beginning of their version. Kaboom.

It would appear that Adobe is simply ignoring the bad appended changes. iText could do this too, but so can you:

When iText fails to open a PDF:
1. Search backwards through the file looking for the second to last %%EOF. Ignore the one at the very end, we want the previous state of the file. 2. Delete everything after the 2nd-to-last %%EOF (if any), and try to open it again.

The sad thing is that this broken PDF could have been completely different from the "original" 70kb, and then some IO error overwrote the first part of the file. Unlikely, but there's no way to be sure.

like image 161
Mark Storer Avatar answered Dec 26 '22 14:12

Mark Storer