Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Could not find x-ref table PDF

Tags:

pdf

I'm trying to load a PDF file so I can extract it as an image. I've tried a couple of packages in Python e.g. PyPDF2, but each time I encounter the message "Could not find xref table at specified location".

I don't have any experience with PDFs and Python, so any tips would be appreciated. An example file is given here:

https://beta.companieshouse.gov.uk/company/00002404/filing-history

where the PDF is the 'full accounts' link.

Many thanks in advance!

like image 764
Mike Miller Avatar asked Dec 07 '22 16:12

Mike Miller


2 Answers

You can use QPDF for this since it has a faulty xref table recovery method.

Just run qpdf broken.pdf repaired.pdf where broken.pdf is the broken input PDF and repaired.pdf is the new file name.

I tried it with the PDF you linked to and it worked fine.

like image 93
gettalong Avatar answered Jan 25 '23 13:01

gettalong


The PDF in question is broken: The offset of the cross reference table and most object offsets in it are completely wrong.

E.g. the PDF claims that the cross reference table starts at file position 24732 but it actually starts at position 1594356. And the cross reference table entry for object 208 claims it to be at position 24713 while it actually is at 1594337.

Thus the observed error message "Could not find xref table at specified location" is completely correct.

The first offsets in the table are correct, though, at first glance up to the first image stream.

It appears as if the software producing the PDF did not count image stream contents when determining object offsets. Or it took a template with very small placeholder images and replaced the image streams of these small images by much larger streams without updating cross reference offsets.

like image 42
mkl Avatar answered Jan 25 '23 11:01

mkl