When I use the following code <pre class="prettyprint"><code>from PyPDF2 import PdfFileMerger merge = PdfFileMerger() for newFile in nlst: merge.append(newFile) merge.write('newFile.pdf') </code></pre> Something happened as following: <pre class="prettyprint"><code>raise utils.PdfReadError("EOF marker not found") PyPDF2.utils.PdfReadError: EOF marker not found </code></pre> Anybody could tell me what happened? Thanks

After encountering this problem using <code>camelot</code> and <code>PyPDF2</code>, I did some digging and have solved the problem. The end of file marker <code>'%%EOF'</code> is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF. Illustration of what the EOF plus javascript looks like if you open it: <pre class="prettyprint"><code> b'>>\r\n', b'startxref\r\n', b'275824\r\n', b'%%EOF\r\n', b'\n', b'\n', b'<script type="text/javascript">\n', b'\twindow.parent.focus();\n', b'</script><!DOCTYPE html>\n', b'\n', b'\n', b'\n', </code></pre> So you just need to truncate the file before the javascript begins. Solution: <pre class="prettyprint"><code>def reset_eof_of_pdf_return_stream(pdf_stream_in:list): # find the line position of the EOF for i, x in enumerate(txt[::-1]): if b'%%EOF' in x: actual_line = len(pdf_stream_in)-i print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}') break # return the list up to that point return pdf_stream_in[:actual_line] # opens the file for reading with open('data/XXX.pdf', 'rb') as p: txt = (p.readlines()) # get the new list terminating correctly txtx = reset_eof_of_pdf_return_stream(txt) # write to new pdf with open('data/XXX_fixed.pdf', 'wb' as f: f.writelines(txtx) fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf') </code></pre>

EOF marker not found while use PyPDF2 merge pdf file in python

Tags:

python

pdf

pypdf2

When I use the following code

from PyPDF2 import PdfFileMerger

merge = PdfFileMerger()

    for newFile in nlst:
        merge.append(newFile)
    merge.write('newFile.pdf')

Something happened as following:

raise utils.PdfReadError("EOF marker not found")

PyPDF2.utils.PdfReadError: EOF marker not found

Anybody could tell me what happened? Thanks

833

asked Jul 29 '17 14:07

DBDBDDB

1 Answers

After encountering this problem using camelot and PyPDF2, I did some digging and have solved the problem.

The end of file marker '%%EOF' is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.

Illustration of what the EOF plus javascript looks like if you open it:

 b'>>\r\n',
 b'startxref\r\n',
 b'275824\r\n',
 b'%%EOF\r\n',
 b'\n',
 b'\n',
 b'<script type="text/javascript">\n',
 b'\twindow.parent.focus();\n',
 b'</script><!DOCTYPE html>\n',
 b'\n',
 b'\n',
 b'\n',

So you just need to truncate the file before the javascript begins.

Solution:

def reset_eof_of_pdf_return_stream(pdf_stream_in:list):
    # find the line position of the EOF
    for i, x in enumerate(txt[::-1]):
        if b'%%EOF' in x:
            actual_line = len(pdf_stream_in)-i
            print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}')
            break

    # return the list up to that point
    return pdf_stream_in[:actual_line]

# opens the file for reading
with open('data/XXX.pdf', 'rb') as p:
    txt = (p.readlines())

# get the new list terminating correctly
txtx = reset_eof_of_pdf_return_stream(txt)

# write to new pdf
with open('data/XXX_fixed.pdf', 'wb' as f:
    f.writelines(txtx)

fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf')

178

answered Sep 20 '22 13:09

glycoaddict

Related questions
                            
                                Formatting output of CSV file in Python
                            
                                How to convert pandas dataframe rows into columns, based on category?
                            
                                how to change datetime to string in sqlalchemy query? [duplicate]
                            
                                Creating a Matrix in Python without numpy [duplicate]
                            
                                'numpy.ndarray' object has no attribute 'values'
                            
                                What's a good use case for enums in python?
                            
                                df.loc causes a SettingWithCopyWarning warning message
                            
                                Reading YAML file with Python results in AttributeError
                            
                                Django Media url returns 404 NOT FOUND
                            
                                Django finding paths between two vertexes in a graph
                            
                                How to use tqdm through multi process in python?
                            
                                Create wordcloud from dictionary values
                            
                                ValueError: Length mismatch: Expected axis has 0 elements while creating hierarchical columns in pandas dataframe
                            
                                Is it possible to export a pandas dataframe styler object to html
                            
                                Pandas json_normalize and null values in JSON
                            
                                Difference between pip3 and `python3 setup.py install` regarding cmdclass argument
                            
                                How to mock uuid generation in a test case?
                            
                                What is the default Celery log level if none is specified?
                            
                                reading a WAV file from TIMIT database in python
                            
                                How to retrieve an Enum key via variable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With