Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pdftk Error: Failed to open PDF file:

Tags:

php

pdf

pdftk

I am using pdftk library to extract the form fields from the pdf .Everything is just running fine except the one issue that i got a pdf file pdf file link. which causes the error is given bellow

Error: Failed to open PDF file: 
   http://www.uscis.gov/sites/default/files/files/form/i-9.pdf
Done.  Input errors, so no output created.

command for this is

root@ri8-MS-7788:/home/ri-8# pdftk http://192.168.1.43/form/i-9.pdf dump_data_fields

the same command is working for all other forms .

Attempt1

I have tried to encrypt the pdf to unsafe version but it produce the same error . here is the command

pdftk http://192.168.1.43/forms/i-9.pdf input_pw foopass output /var/www/forms/un-i-9.pdf

Update

this is my full function to handle this

public function Formanalysis($pdfname) 
    {
         $pdffile=Yii::app()->getBaseUrl(true).'/uploads/forms/'.$pdfname;
        exec("pdftk ".$pdffile." dump_data_fields 2>&1", $output,$retval);
        //got an error for some pdf if these are secure 
        if(strpos($output[0],'Error') !== false)
        {
            $unsafepdf=Yii::getPathOfAlias('webroot').'/uploads/forms/un-'.$pdfname;
            //echo "pdftk ".$pdffile." input_pw foopass output ".$unsafepdf;
            exec("pdftk ".$pdffile." input_pw foopass output ".$unsafepdf);
            exec("pdftk ".$unsafepdf." dump_data_fields 2>&1", $outputunsafe,$retval);
            return $outputunsafe ;
            //$response=array('0'=>'error','error'=>$output[0]);
            //return $response;

        }

        //if (strpos($output[0],'Error') !== false){ echo  "error to run" ; }   // this is the option to handle error 
        return $output;

    }
like image 220
Manoj Dhiman Avatar asked May 15 '15 12:05

Manoj Dhiman


2 Answers

this may be a little trick solution but should work for you . as @bruno said that this is encrypted file . You should decrypt this before you use for the pdftk . For this i found a way to decrypt that is qpdf a free opem source library to decrypt the pdf, remove the owner and user passwords etc and many more. You can find this here Qpdf. install it on your system . and run this command

qpdf --decrypt input.pdf output.pdf

then use the output file in the pdftk command . it should work .

like image 52
manoj dhiman Avatar answered Oct 05 '22 23:10

manoj dhiman


PdfTk is a tool that was created by compiling an obsolete version of iText to an executable using the GNU Compiler for Java (GCJ) (PdfTk is not endorsed by iText Group NV).

I have examined your PDF and it uses two technologies that weren't supported by iText at the time PdfTk was created: XFA and compressed cross-reference tables.

The latter is what causes your problem. PdfTk expects your file to end like this:

xref
0 7
0000000000 65535 f 
0000000258 00000 n 
0000000015 00000 n 
0000000346 00000 n 
0000000146 00000 n 
0000000397 00000 n 
0000000442 00000 n 
trailer
<</ID [<c8bf0ac531b0fc7b5b9ec5daf0296834><ec4dde54d00305ebbec62f3f6bbca974>]/Root 5 0 R/Size 7/Info 6 0 R>>
%iText-5.4.3
startxref
595
%%EOF

In this snippet startxref marks the byte offset of xref which is where the cross-reference table starts. This table contains the byte-offsets of all the objects in the PDF.

When you look at the PDF you refer to, you see that it ends like this:

64 0 obj
<</DecodeParms<</Columns 5/Predictor 12>>/Encrypt 972 0 R/Filter/FlateDecode/ID[<85C47EA3EFE49E4CB0F087350055FDDC><C3F1748360D0464FBA02D711DE864630>]/Info 970 0 R/Length 283/Root 973 0 R/Size 971/Type/XRef/W[1 3 1]>>stream
hÞìÒ±JQЙ·»7J¢©ÕØ(Xþ„ù »h%¤É¤¶”€mZ+;ÁN,,ÁÆ6 XÁ&‚("î½YŒI‘Bî‡áμ]ö1Áð÷³cfþ‹ûÐÚLî`z„Ýôœùw÷N×X?ÙkNv`hÁÒj¦G[œiÀå»›œ?b½Än…ÉëàÍþ gY—i7WW‡òj®îÍ°u¸Ò‡Ñ:óÆÛ™ñÎë&'×݈§ü†ù!ÿñ€ù%,\ácçÙ9˜ì±Þ€S¼Ãd—‰Áy~×.ø¶Åìþßn_˜$9Ôüw£X9#åxzçgRüüóÙwÝ¡œÄNJ©½’Ú+©½’R{%µWR{%ÿ·á”;`_ z6Ø
endstream
endobj
startxref
116
%%EOF

In this case, startxref still refers to where the first cross-reference table starts (it's a linearized PDF), but the cross reference table is stored inside an object, and that object is compressed (see the gibberish between the stream and endstream keywords).

Compressed cross-reference tables and compressed objects were introduced in PDF 1.5 (2003), but they aren't supported by PdfTk. You'll have to find a tool that can deal with such streams (e.g. a recent version of iText, which is the real stuff when compared to PdfTk), or you have to save your PDF as a PDF 1.4 before you treat it with PdfTk (but you'll lose the XFA, because XFA was also introduced in PDF 1.5).

Update:

Since you are asking about form fields, I'm adding the following attachment:

enter image description here

This screenshot was taken using iText RUPS (which proves that iText can open the document). To the right, you see that the same form is defined twice:

enter image description here

If you would walk down the tree under Fields, you'd find all the fields that are stored in the PDF using AcroForm technology. To the left, you can see the description of such a field:

enter image description here

If you look under XFA, you notice that the same form is also defined using the XML Forms Architecture. If you click on datasets, you see the XML description of the dataset in the lower panel:

enter image description here

All of this information can be accessed programmatically using iText (Java) or iTextSharp (C#). PdfTk is merely a tool based on a very old version of this technology.

like image 31
Bruno Lowagie Avatar answered Oct 06 '22 00:10

Bruno Lowagie