Is there any way of checking if a byte[] is a pdf without opening?
I have some code to display a list of byte[] as pdf thumbnails. I previously knew all the byte[] were pdf's because we filtered the servlet to only return these. Now the requirement has changed and I need to bring all file types back. Is there any way of checking what the byte[] is, or more specifically determining if it isn't, a pdf?
Check the first 4 bytes of the array. If those are 0x25 0x50 0x44 0x46 then it's most probably a PDF file.
If you've ever downloaded a printable form or document from the Web, such as an IRS tax form, there's a good chance it was a PDF file. Whenever you see a file that ends with . pdf, that means it's a PDF file.
Try the libmagic (the "file" command on the bash uses it). This does exactly the same check as in (1) Take a lib and try to read the page-count out of the file. If the lib is able to read a pagecount it should be a valid pdf.
Check the first 4 bytes of the array.
If those are 0x25 0x50 0x44 0x46
then it's most probably a PDF file.
First four bytes should be: 0x25 0x50 0x44 0x46
(in hex format, in ASCII it's %PDF
). "Magic numbers" for another formats you can find here
As far as I know all PDF's start with %PDF
, so you could check the first bytes against this string.
While the marked answer and the other answers are correct, they will not be successful 100% of the time. The problem is the PDF spec says the %PDF-1.x only needs to be in the first 1024 bytes and not the first 4. Some programs will add information before %PDF and still be valid.
I would recommend seeing the answer for the following Stack Overflow question: How to detect if a file is PDF or TIFF?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With