Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pdftk will not decompress data streams

Tags:

pdf

pdftk

I have been trying to work with pdftk to inspect information from compressed pdf streams created by Nitro Reader, but pdftk will not deflate the streams. It produces no errors, but it does not seem to do anything beyond reordering the pdf objects. Here is a minimal example of one of these pdfs.

    pdftk test.pdf output test-d.pdf uncompress

When I try pdftk on other pdfs, it seems to work fine. If I manually extract the data streams and decompress them using zlib in Python, they decompress properly. Also, if I open the pdf in Adobe Reader and re-save, pdftk works fine on the resulting pdf.

I have manually inspected the Nitro pdf to the best of my ability, and it seems to be a valid pdf. I am very confused as to what is going on here.

As background to the problem, I have hundreds of these pdfs, and I am trying search for certain keywords, which I should be able to do if I can automate the decompression.

pdftk version 1.45
Windows 7 Home Premium SP1
Nitro Reader 2 version 2.5.0.36

Thanks, James

like image 778
James Duvall Avatar asked Feb 25 '13 00:02

James Duvall


1 Answers

If you are not attached to pdftk, you can use qpdf. For instance, you could use:

$ qpdf --stream-data=uncompress input.pdf output.pdf

For what it is worth, if there are blobs, they still might appear as binary. Although, the rest of the stream will be uncompressed (either with pdftk or qpdf). qpdf allows you to uncompress all or only the streams.

From qpdf manual:

When --stream-data=uncompress is specified, qpdf will attempt to remove any non-lossy filters that it supports. This includes /FlateDecode, /LZWDecode, /ASCII85Decode, and /ASCIIHexDecode. This can be very useful for inspecting the contents of various streams.

The same could happen with pdftk.

like image 194
gpoo Avatar answered Dec 27 '22 10:12

gpoo