Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Xref table not zero-indexed. ID numbers for objects will be corrected. won't continue

I am trying to open a pdf to get the number of pages. I am using PyPDF2.

Here is my code:

def pdfPageReader(file_name):
    try:
        reader = PyPDF2.PdfReader(file_name, strict=True)
        number_of_pages = len(reader.pages)
        print(f"{file_name} = {number_of_pages}")
        return number_of_pages
    except:
        return "1"

But then i run into this error:

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]

I tried to use strict=True and strict=False, When it is True, it displays this message, and nothing, I waited for 30minutes, but nothing happened. When it is False, it just display nothing, and that's it, just do nothing, if I press ctrl+c on the terminal (cmd, windows 10) then it cancel that open and continues (I run this in a batch of pdf files). Only 1 in the batch got this problem.

My questions are, how do I fix this, or how do I skip this, or how can I cancel this and move on with the other pdf files?

like image 763
JBin Avatar asked Apr 20 '18 10:04

JBin


2 Answers

If somebody had a similar problem and it even crashed the program with this error message

File "C:\Programy\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1604, in getObject % (indirectReference.idnum, indirectReference.generation, idnum, generation)) PyPDF2.utils.PdfReadError: Expected object ID (14 0) does not match actual (13 0); xref table not zero-indexed.

It helped me to add the strict argument equal to False for my pdf reader

pdf_reader = PdfFileReader(input_file,strict=False)

like image 109
DovaX Avatar answered Nov 16 '22 02:11

DovaX


For anybody else who may be running into this problem, and found that strict=False didn't help, I was able to solve the problem by just re-saving a new copy of the file in Adobe Acrobat Reader. I just opened the PDF file inside an actual copy of Adobe Acrobat Reader (the plain ol' free version on Windows), did a "Save as...", and gave the file a new name. Then I ran my script again using the newly saved copy of my PDF file.

Apparently, the PDF file I was using, which were generated directly from my scanner, were somehow corrupt, even though I could open and view it just fine in Reader. Making a duplicate copy of the file via re-saving in Acrobat Reader somehow seemed to correct whatever was missing.

like image 40
Bill M. Avatar answered Nov 16 '22 04:11

Bill M.