Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find blank pages inside a PDF using PDFBox?

Tags:

java

pdf

Here is the challenge I'm currently facing.
I have a lot of PDFs and I have to remove the blank pages inside them and display only the pages with content (text or images).
The problem is that those pdfs are scanned documents.
So the blank pages have some dirty left behind by the scanner.

like image 831
Shoyo Avatar asked May 19 '14 13:05

Shoyo


1 Answers

I did some research and ended up with this code that checks for 99% of the page as white or light gray. I needed the gray factor as the scanned documents sometimes are not pure white.

private static Boolean isBlank(PDPage pdfPage) throws IOException {
    BufferedImage bufferedImage = pdfPage.convertToImage();
    long count = 0;
    int height = bufferedImage.getHeight();
    int width = bufferedImage.getWidth();
    Double areaFactor = (width * height) * 0.99;

    for (int x = 0; x < width ; x++) {
        for (int y = 0; y < height ; y++) {
            Color c = new Color(bufferedImage.getRGB(x, y));
            // verify light gray and white
            if (c.getRed() == c.getGreen() && c.getRed() == c.getBlue()
                    && c.getRed() >= 248) {
                 count++;
            }
        }
    }

    if (count >= areaFactor) {
        return true;
    }

    return false;
}
like image 196
Shoyo Avatar answered Sep 29 '22 17:09

Shoyo