Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

EOF marker not found while use PyPDF2 merge pdf file in python

Tags:

python

pdf

pypdf2

When I use the following code

from PyPDF2 import PdfFileMerger

merge = PdfFileMerger()

    for newFile in nlst:
        merge.append(newFile)
    merge.write('newFile.pdf')

Something happened as following:

raise utils.PdfReadError("EOF marker not found")

PyPDF2.utils.PdfReadError: EOF marker not found

Anybody could tell me what happened? Thanks

like image 833
DBDBDDB Avatar asked Jul 29 '17 14:07

DBDBDDB


People also ask

What is EOF marker in Python?

The end of file marker '%%EOF' is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.

What is EOF in PDF?

%%EOF. The last line of the PDF document contains the end of the “%%EOF” file string.

How do I read a PDF in PyPDF2?

Though PyPDF2 doesn't contain any specific method to read remote files, you can use Python's urllib. request module to first read the remote file in bytes and then pass the file in the bytes format to PdfFileReader() method. The rest of the process is similar to reading a local PDF file.

What is PyPDF2 in Python?

PyPDF2 is a Python library for working with PDF documents. It can be used to parse PDFs, modify them, and create new PDFs. PyPDF2 can be used to extract some text and metadata from a PDF.


1 Answers

After encountering this problem using camelot and PyPDF2, I did some digging and have solved the problem.

The end of file marker '%%EOF' is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.

Illustration of what the EOF plus javascript looks like if you open it:

 b'>>\r\n',
 b'startxref\r\n',
 b'275824\r\n',
 b'%%EOF\r\n',
 b'\n',
 b'\n',
 b'<script type="text/javascript">\n',
 b'\twindow.parent.focus();\n',
 b'</script><!DOCTYPE html>\n',
 b'\n',
 b'\n',
 b'\n',

So you just need to truncate the file before the javascript begins.

Solution:

def reset_eof_of_pdf_return_stream(pdf_stream_in:list):
    # find the line position of the EOF
    for i, x in enumerate(txt[::-1]):
        if b'%%EOF' in x:
            actual_line = len(pdf_stream_in)-i
            print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}')
            break

    # return the list up to that point
    return pdf_stream_in[:actual_line]

# opens the file for reading
with open('data/XXX.pdf', 'rb') as p:
    txt = (p.readlines())

# get the new list terminating correctly
txtx = reset_eof_of_pdf_return_stream(txt)

# write to new pdf
with open('data/XXX_fixed.pdf', 'wb' as f:
    f.writelines(txtx)

fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf')
like image 178
glycoaddict Avatar answered Sep 20 '22 13:09

glycoaddict