Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are byte order marks allowed in PDF document?

I'm having an issue with a filter program I wrote. It detects if a file is a PDF document by reading the first 5 bytes of the file and comparing it to a fixed buffer :

25 50 44 46 2D

This works fine except that I'm seeing a few files that starts with a byte order mark instead:

EF BB BF 25 50 44 46 2D ^-------^

I'm wondering if that is actually allowed by the PDF specs. If I check section 7.5 of that documentation, I read it as "no":

The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7

Yet, I see these documents in the wild and the users gets confused because PDF reader programs can open these documents by my filter reject them.

So: are BOM markers allowed at the start of PDF documents ? (I'm NOT talking about string objects here but the PDF file itself)

like image 838
Stephane Avatar asked Oct 15 '15 15:10

Stephane


Video Answer


2 Answers

So: are BOM markers allowed at the start of PDF documents ?

No, just like you read in the specification, nothing is allowed before the "%PDF" bytes.

But Adobe Reader has a long history of accepting files in spite of some leading or trailing trash bytes.

Cf. the implementation notes in Appendix H of Adobe's pdf_reference_1-7:

3.4.1, “File Header”

  1. Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.

  2. Acrobat viewers also accept a header of the form

    %!PS−Adobe−N.n PDF−M.m
    

...

3.4.4, “File Trailer”

  1. Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.

And people have a tendency to think that a PDF that Adobe Reader displays as desired is valid, there are many PDFs in the wild that do have trash bytes up front.

like image 153
mkl Avatar answered Oct 29 '22 08:10

mkl


No, a BOM is not valid at the front a PDF file.

A PDF is a binary file format so a BOM wouldn't actually make sense, it would be like having a BOM at the front of a ZIP file or a JPEG.

I'm guessing the PDFs that you are consuming are coming from misconfigured applications that either have something already at the front of their output buffer already or, more likely, are created with the incorrect assumption that a PDF is a text-based format.

like image 40
Chris Haas Avatar answered Oct 29 '22 09:10

Chris Haas