Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find Blank Page in pdf file

I can not detect blank page in pdf file. I have searched internet for it but could not find a good solution.

Using Itextsharp I tried with page size, Xobjects. But they do not give exact result.

I tried

if(xobjects==null || textcontent==null || size <20 bytes )
  then "blank"
else
 not blank

But maximum time it returns wrong answer. I have used Itextsharp

The code is below... I am using Itextsharp Librabry

For xobjects

PdfDictionary xobjects = resourceDic.GetAsDict(PdfName.XOBJECT);
//here resourceDic is PdfDictionary type
//I know that if Xobjects is null then page is blank. But sometimes blank page gives xobjects which is not null.

For contentstream

 RandomAccessFileOrArray f = reader.SafeFile;
 //here reader = new PdfReader(filename);

 byte[] contentBytes = reader.GetPageContent(pageNum, f);
 //I have measured the size of contentbytes but sometimes it gives more than 20 bytes for   blank page

For textcontent

String extractedText = PdfTextExtractor.GetTextFromPage(reader, pageNum, new LocationTextExtractionStrategy());
  // sometimes blank page give a text more than 20 char length .
like image 463
Md Kamruzzaman Sarker Avatar asked Jun 09 '12 15:06

Md Kamruzzaman Sarker


2 Answers

A very simple way to discover empty pages is this: use a Ghostscript commandline that calls the bbox device.

Ghostscript's bbox calculates the coordinates of that minimum rectangle 'bounding box' which encloses all points of the page where a pixel would be rendered:

gs \
  -o /dev/null \
  -sDEVICE=bbox \
   input.pdf

On Windows:

gswin32c.exe ^
  -o nul ^
  -sDEVICE=bbox ^
   input.pdf

Result:

GPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 6.
Page 1
%%BoundingBox: 27 281 548 804
%%HiResBoundingBox: 27.000000 281.000000 547.332031 804.000000
Page 2
%%BoundingBox: 0 0 0 0
%%HiResBoundingBox: 0.000000 0.000000 0.000000 0.000000
Page 3
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 4
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 5
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 6
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000

As you can see, page 2 of my input document was empty.

like image 156
Kurt Pfeifle Avatar answered Oct 04 '22 23:10

Kurt Pfeifle


I suspect you have tried .Trim() on your strings, so I won't suggest that on it's own.

What is the actual contents of the 20+ char length strings in the blank? I suspect it is just new line characters (like what happens when people press enter 10+ times just to get a new page rather than inserting a page-break), in which case:

String extractedText = 
    string.Replace(string.Replace(
        PdfTextExtractor.GetTextFromPage(reader, pageNum, new LocationTextExtractionStrategy())
    , Environment.NewLine, ""), "\n", "").Trim();

Let us know what the output contents is after this.

Another possibility is that it's blank text with non-breaking spaces and other characters that aren't actually spaces, you'll need to find and replace these manually.. at which point I would instead suggest that you actually just use a regex match for [0-9,a-z,A-Z] and use that to determine if your page is blank or not.

like image 29
Seph Avatar answered Oct 04 '22 23:10

Seph