Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyPDF2 write doesn't work on some PDF files (Python 3.5.1)

First of all I am using Python 3.5.1 (32 bit version) I wrote the following program to add a pagenumber on all pages of my pdf files using PyPDF2 and reportlab:

#import modules
from os import listdir
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
#initial values of variable declarations
PDFlist=[]
X_value=460
Y_value=820
#Make a list of al files in de directory
filelist = listdir()
#Make a list of all pdf files in the directory
for i in range(0,len(filelist)):
    filename=filelist[i]
    for j in range(0,len(filename)):
        char=filename[j]
        if char=='.':
            extension=filename[j+1:j+4]
            if extension=='pdf':
                PDFlist.append(filename)
        j=j+1
    i=i+1
# Give the horizontal position for the page number (Enter = use default value of 480)
User = input('Give horizontal position page number (ENTER = default 460): ')
if User != "":
    X_value=int(User)
# Give the vertical position for the page number (Enter = use default value of 820)
User = input('Give horizontal position page number (ENTER = default 820): ')
if User != "":
    Y_value=int(User)

for i in range(0,len(PDFlist)):
    filename=PDFlist[i]

    # read the PDF
    existing_pdf = PdfFileReader(open(filename, "rb"))
    print("File: "+filename)
    # count the number of pages
    number_of_pages = existing_pdf.getNumPages()
    print("Number of pages detected:"+str(number_of_pages))
    output = PdfFileWriter()

    for k in range(0,number_of_pages):
        packet = io.BytesIO()

        # create a new PDF with Reportlab
        can = canvas.Canvas(packet, pagesize=A4)
        Pagenumber=" Page "+str(k+1)+"/"+str(number_of_pages)
        # we first make a white rectangle to cover any existing text in the pdf
        can.setFillColorRGB(1,1,1)
        can.setStrokeColorRGB(1,1,1)
        can.rect(X_value-10,Y_value-5,120,20,fill=1)
        # set the font and size
        can.setFont("Helvetica",14)
        # choose color of page numbers (red)
        can.setFillColorRGB(1,0,0)
        can.drawString(X_value, Y_value, Pagenumber)
        can.save()
        print(Pagenumber)

        #move to the beginning of the StringIO buffer
        packet.seek(0)
        new_pdf = PdfFileReader(packet)
        # add the "watermark" (which is the new pdf) on the existing page
        page = existing_pdf.getPage(k)
        page.mergePage(new_pdf.getPage(0))
        output.addPage(page)
        k=k+1
    # finally, write "output" to a real file

    ResultPDF="Output/"+filename
    outputStream = open(ResultPDF, "wb")
    output.write(outputStream)
    outputStream.close()
    i=i+1

This program works fine for quite a number of PDF files (albeit that warnings are sometimes generated like 'PdfReadWarning: Superfluous whitespace found in object header b'16' b'0' [pdf.py:1666]' but the resulting output file is okay to me). However, the program just doesn't work on some PDF files although these files are perfectly readable and editable with my Adobe Acrobat. I have the impression the error pops up mostly on PDF files that were scanned but not on all of them (I also numbered scanned PDF files that didn't generate any error). I am getting the following error message (the first 8 lines are the result of my own print commands):

File: Scanned file.pdf
Number of pages detected:6
 Page 1/6
 Page 2/6
 Page 3/6
 Page 4/6
 Page 5/6
 Page 6/6
PdfReadWarning: Object 25 1 not defined. [pdf.py:1629]
Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\Sourcecode\PDFPager.py", line 83, in <module>
    output.write(outputStream)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 482, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 577, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 1631, in getObject
    raise utils.PdfReadError("Could not find object.")
PyPDF2.utils.PdfReadError: Could not find object.

Apparently the pages are merged with the PDF created by reportlab (see lines up to page 6/6) but in the end no output PDF file can be generated by PyPDF2 (I get an unreadible output file of 0 bytes). Can somebody shed some light on how to resolve this? I searched the internet but couldn't really find an answer.

like image 700
Max Eisert Avatar asked Aug 31 '17 09:08

Max Eisert


2 Answers

Using "strict = false" got things working for me.

from PyPDF2 import PdfFileMerger

pdfs = [r'file 1.pdf', r'file 2.pdf']

merger = PdfFileMerger(strict=False)

for pdf in pdfs:
    merger.append(pdf)

merger.write(r"thanks mate.pdf")
like image 57
Ninga Avatar answered Sep 21 '22 16:09

Ninga


On pdf.py do the following changes:

on line 1633 of pdf. py (which means uncommenting the if self.strict)

    if self.strict:
        raise utils.PdfReadError("Could not find object.")

and on line 501 on pdf.py make the following changes (adding a try, except block)

    try:
        obj.writeToStream(stream, key)
        stream.write(b_("\nendobj\n"))
    except:
        pass

Cheers.

like image 27
bmg Avatar answered Sep 20 '22 16:09

bmg