I'm having an issue with a filter program I wrote. It detects if a file is a PDF document by reading the first 5 bytes of the file and comparing it to a fixed buffer :
25 50 44 46 2D
This works fine except that I'm seeing a few files that starts with a byte order mark instead:
EF BB BF 25 50 44 46 2D
^-------^
I'm wondering if that is actually allowed by the PDF specs. If I check section 7.5 of that documentation, I read it as "no":
The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7
Yet, I see these documents in the wild and the users gets confused because PDF reader programs can open these documents by my filter reject them.
So: are BOM markers allowed at the start of PDF documents ? (I'm NOT talking about string objects here but the PDF file itself)
So: are BOM markers allowed at the start of PDF documents ?
No, just like you read in the specification, nothing is allowed before the "%PDF" bytes.
But Adobe Reader has a long history of accepting files in spite of some leading or trailing trash bytes.
Cf. the implementation notes in Appendix H of Adobe's pdf_reference_1-7:
3.4.1, “File Header”
Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.
Acrobat viewers also accept a header of the form
%!PS−Adobe−N.n PDF−M.m
...
3.4.4, “File Trailer”
- Acrobat viewers require only that the
%%EOF
marker appear somewhere within the last 1024 bytes of the file.
And people have a tendency to think that a PDF that Adobe Reader displays as desired is valid, there are many PDFs in the wild that do have trash bytes up front.
No, a BOM
is not valid at the front a PDF
file.
A PDF is a binary file format so a BOM wouldn't actually make sense, it would be like having a BOM at the front of a ZIP file or a JPEG.
I'm guessing the PDFs that you are consuming are coming from misconfigured applications that either have something already at the front of their output buffer already or, more likely, are created with the incorrect assumption that a PDF is a text-based format.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With