How can I determine if a file is a PDF file?

People also ask

How do I make a document a PDF?

Open Acrobat and choose “Tools” > “Create PDF”. Select the file type you want to create a PDF from: single file, multiple files, scan, or other option. Click “Create” or “Next” depending on the file type. Follow the prompts to convert to PDF and save to your desired location.

Is PDF a file or format?

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

What is a valid PDF file?

Encryption — A PDF is considered invalid if it is encrypted, but it becomes valid when decrypted. Missing Header — The PDF spec states that any file with the . pdf extension must include a file header that defines the version of the specification that the file adheres to.

Here is what I use into my NUnit tests, that must validate against multiple versions of PDF generated using Crystal Reports:

public static void CheckIsPDF(byte[] data)
    {
        Assert.IsNotNull(data);
        Assert.Greater(data.Length,4);

        // header 
        Assert.AreEqual(data[0],0x25); // %
        Assert.AreEqual(data[1],0x50); // P
        Assert.AreEqual(data[2],0x44); // D
        Assert.AreEqual(data[3],0x46); // F
        Assert.AreEqual(data[4],0x2D); // -

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x33) // version is 1.3 ?
        {                  
            // file terminator
            Assert.AreEqual(data[data.Length-7],0x25); // %
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x45); // E
            Assert.AreEqual(data[data.Length-4],0x4F); // O
            Assert.AreEqual(data[data.Length-3],0x46); // F
            Assert.AreEqual(data[data.Length-2],0x20); // SPACE
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x34) // version is 1.4 ?
        {
            // file terminator
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x25); // %
            Assert.AreEqual(data[data.Length-4],0x45); // E
            Assert.AreEqual(data[data.Length-3],0x4F); // O
            Assert.AreEqual(data[data.Length-2],0x46); // F
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        Assert.Fail("Unsupported file format");
    }

you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (http://aperture.sourceforge.net/) or I saw some days ago a library just for that (http://sourceforge.net/projects/mime-util)

I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)

Since you use PDFBox you can simply do:

PDDocument.load(file);

It'll fail with an Exception if the PDF is corrupted etc.

If it succeeds you can also check if the PDF is encrypted using .isEncrypted()

Here an adapted Java version of NinjaCross's code.

/**
 * Test if the data in the given byte array represents a PDF file.
 */
public static boolean is_pdf(byte[] data) {
    if (data != null && data.length > 4 &&
            data[0] == 0x25 && // %
            data[1] == 0x50 && // P
            data[2] == 0x44 && // D
            data[3] == 0x46 && // F
            data[4] == 0x2D) { // -

        // version 1.3 file terminator
        if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
                data[data.length - 7] == 0x25 && // %
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x45 && // E
                data[data.length - 4] == 0x4F && // O
                data[data.length - 3] == 0x46 && // F
                data[data.length - 2] == 0x20 && // SPACE
                data[data.length - 1] == 0x0A) { // EOL
            return true;
        }

        // version 1.3 file terminator
        if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x25 && // %
                data[data.length - 4] == 0x45 && // E
                data[data.length - 3] == 0x4F && // O
                data[data.length - 2] == 0x46 && // F
                data[data.length - 1] == 0x0A) { // EOL
            return true;
        }
    }
    return false;
}

And some simple unit tests:

@Test
public void test_valid_pdf_1_3_data_is_pdf() {
    assertTrue(is_pdf("%PDF-1.3 CONTENT %%EOF \n".getBytes()));
}

@Test
public void test_valid_pdf_1_4_data_is_pdf() {
    assertTrue(is_pdf("%PDF-1.4 CONTENT %%EOF\n".getBytes()));
}

@Test
public void test_invalid_data_is_not_pdf() {
    assertFalse(is_pdf("Hello World".getBytes()));
}

If you come up with any failing unit tests, please let me know.

I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.

Eventually, after tinkering around with different methods in the API, I tried this:

PDDocument.load(file).getPage(0).getContents().toString();

This did not throw an exception, but it did output this:

 WARN  [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015

Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.

To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).

I then implemented the following:

        RandomAccessFile accessFile = new RandomAccessFile(file, "r");
        PDFParser parser = new PDFParser(accessFile); 
        parser.setLenient(false);
        parser.parse();

This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!

Pdf files begin "%PDF" (open one in TextPad or similar and take a look)

Any reason you can't just read the file with a StringReader and check for this?

Related questions
                            
                                Java ProcessBuilder: Resultant Process Hangs
                            
                                Starting a process in Java?
                            
                                Creating zip archive in Java
                            
                                Hibernate: CRUD Generic DAO
                            
                                FindBugs warning: Inefficient use of keySet iterator instead of entrySet iterator
                            
                                How to get raw text from pdf file using java
                            
                                How does the JVM decided to JIT-compile a method (categorize a method as "hot")?
                            
                                How to shutdown jshell at the end of the script?
                            
                                How do I find out which JAXP implementation is in use and where it was loaded from?
                            
                                Instantiate Dictionary<T, U> in Java error
                            
                                Do JSON keys need to be unique? [duplicate]
                            
                                How to set working directory with ProcessBuilder
                            
                                Utility method for wrapping an object in a collection
                            
                                ArrayList vs LinkedList from memory allocation perspective
                            
                                EJB @Schedule wait until method completed
                            
                                Download file with original file name
                            
                                HashSet vs TreeSet vs LinkedHashSet on basis of adding duplicate value
                            
                                What is the C# equivalence for the JAVA `System.exit(0);`? [duplicate]
                            
                                JNI vs. JNA performance
                            
                                Intellij IDEA checkstyle

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I determine if a file is a PDF file?

Tags:

java

text

validation

pdf

People also ask

Recent Activity

Donate For Us