Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyPDF2 returning blank PDF after copy

def EncryptPDFFiles(password, directory):
    pdfFiles = []
    success = 0

    # Get all PDF files from a directory
    for folderName, subFolders, fileNames in os.walk(directory):
        for fileName in fileNames:
            if (fileName.endswith(".pdf")):
                pdfFiles.append(os.path.join(folderName, fileName))
    print("%s PDF documents found." % str(len(pdfFiles)))

    # Create an encrypted version for each document
    for pdf in pdfFiles:
        # Copy old PDF into a new PDF object
        pdfFile = open(pdf,"rb")
        pdfReader = PyPDF2.PdfFileReader(pdfFile)
        pdfWriter = PyPDF2.PdfFileWriter()
        for pageNum in range(pdfReader.numPages):
            pdfWriter.addPage(pdfReader.getPage(pageNum))
        pdfFile.close()

        # Encrypt the new PDF and save it
        saveName = pdf.replace(".pdf",ENCRYPTION_TAG)
        pdfWriter.encrypt(password)
        newFile = open(saveName, "wb")
        pdfWriter.write(newFile)
        newFile.close()
        print("%s saved to: %s" % (pdf, saveName))


        # Verify the the encrypted PDF encrypted properly
        encryptedPdfFile = open(saveName,"rb")
        encryptedPdfReader = PyPDF2.PdfFileReader(encryptedPdfFile)
        canDecrypt = encryptedPdfReader.decrypt(password)
        encryptedPdfFile.close()
        if (canDecrypt):
            print("%s successfully encrypted." % (pdf))
            send2trash.send2trash(pdf)
            success += 1

    print("%s of %s successfully encrypted." % (str(success),str(len(pdfFiles))))

I am following along with Pythons Automate the Boring Stuff section. I've had off and on issues when doing the copy for a PDF document but as of right now everytime I run the program my copied PDF is all blank pages. There are the correct amount of pages of my newly encrypted PDF but they are all blank (no content on the pages). I've had this happen before but was not able to recreate. I've tried throwing in a sleep before closing my files. I'm not sure what the best practice for opening and closing files are in Python. For reference I'm using Python3.

like image 282
stryker14 Avatar asked Jun 05 '17 18:06

stryker14


People also ask

Is PyPDF2 maintained?

PyPDF2 is maintained again since April 2022. I'm the new maintainer. Since then, we fixed a lot of things. I'm currently downloading 800,000 PDF files from Tikas test dataset to ensure we can parse them.

How do I read a PDF in PyPDF2?

Though PyPDF2 doesn't contain any specific method to read remote files, you can use Python's urllib. request module to first read the remote file in bytes and then pass the file in the bytes format to PdfFileReader() method. The rest of the process is similar to reading a local PDF file.

Is PyPDF2 an OCR?

PyPDF2 isn't an OCR program.

What is PyPDF2?

PyPDF2: It is a python library used for performing major tasks on PDF files such as extracting the document-specific information, merging the PDF files, splitting the pages of a PDF file, adding watermarks to a file, encrypting and decrypting the PDF files, etc.


2 Answers

Try moving the pdfFile.close to the very end of your for loop.

for pdf in pdfFiles:
    #
    # {stuff}
    #
    if (canDecrypt):
        print("%s successfully encrypted." % (pdf))
        send2trash.send2trash(pdf)
        success += 1

    pdfFile.close()

The thought is that the pdfFile needs to be available and open when the pdfWriter finally writes out, otherwise it cannot access the pages to write the new file.

like image 195
James C. Taylor Avatar answered Oct 09 '22 10:10

James C. Taylor


The issue with getting a blank page even after adding a page to your pdf with writer.addPage(your_page_name) is the context manager. You have to make sure that you're not closing the pdf from which you're reading the page.

For Example:

with open(str(_pdf), "rb") as in_f:
    reader = PdfFileReader(in_f)
    _page = reader.getPage(0)
    writer = PdfFileWriter()
    writer.addPage(_page)

with open(_filename, "wb+") as out_f:
    writer.write(out_f)

This will NOT WORK since the file handle is being closed by the context manager. The file has to be open So we would have to indent it. Like the following:

with open(str(_pdf), "rb") as in_f:
    reader = PdfFileReader(in_f)
    _page = reader.getPage(0)
    writer = PdfFileWriter()
    writer.addPage(_page)

    with open(_filename, "wb+") as out_f:
        writer.write(out_f)

I know it's not a big deal but this literally made me pull out my hair, indentation wasted my 6 hours. That's why I thought I should write an answer for others

like image 44
Mujeeb Ishaque Avatar answered Oct 09 '22 09:10

Mujeeb Ishaque