When I use the following code
from PyPDF2 import PdfFileMerger
merge = PdfFileMerger()
for newFile in nlst:
merge.append(newFile)
merge.write('newFile.pdf')
Something happened as following:
raise utils.PdfReadError("EOF marker not found")
PyPDF2.utils.PdfReadError: EOF marker not found
Anybody could tell me what happened? Thanks
The end of file marker '%%EOF' is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.
%%EOF. The last line of the PDF document contains the end of the “%%EOF” file string.
Though PyPDF2 doesn't contain any specific method to read remote files, you can use Python's urllib. request module to first read the remote file in bytes and then pass the file in the bytes format to PdfFileReader() method. The rest of the process is similar to reading a local PDF file.
PyPDF2 is a Python library for working with PDF documents. It can be used to parse PDFs, modify them, and create new PDFs. PyPDF2 can be used to extract some text and metadata from a PDF.
After encountering this problem using camelot
and PyPDF2
, I did some digging and have solved the problem.
The end of file marker '%%EOF'
is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.
Illustration of what the EOF plus javascript looks like if you open it:
b'>>\r\n',
b'startxref\r\n',
b'275824\r\n',
b'%%EOF\r\n',
b'\n',
b'\n',
b'<script type="text/javascript">\n',
b'\twindow.parent.focus();\n',
b'</script><!DOCTYPE html>\n',
b'\n',
b'\n',
b'\n',
So you just need to truncate the file before the javascript begins.
Solution:
def reset_eof_of_pdf_return_stream(pdf_stream_in:list):
# find the line position of the EOF
for i, x in enumerate(txt[::-1]):
if b'%%EOF' in x:
actual_line = len(pdf_stream_in)-i
print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}')
break
# return the list up to that point
return pdf_stream_in[:actual_line]
# opens the file for reading
with open('data/XXX.pdf', 'rb') as p:
txt = (p.readlines())
# get the new list terminating correctly
txtx = reset_eof_of_pdf_return_stream(txt)
# write to new pdf
with open('data/XXX_fixed.pdf', 'wb' as f:
f.writelines(txtx)
fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With