Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine whether a PDF page contains text or is purely picture

How to determine whether a PDF page contains text or is purely picture, using Java?

I searched through many forums and websites, but I can not find an answer yet .

Is it possible to extract text from PDF, to know if the page is in the format picture or text?

PdfReader reader = new PdfReader(INPUTFILE);  
        PrintWriter out = new PrintWriter(new FileOutputStream(OUTPUTFILE));              
        for (int i = 1; i <= reader.getNumberOfPages(); i++) { 
         // here I want to test the structure of the page !!!! if it's possible                         
         out.println(PdfTextExtractor.getTextFromPage(reader, i));  
        }
like image 244
Adriano_jvma Avatar asked May 15 '13 16:05

Adriano_jvma


1 Answers

There is no water-proof way to do what you want.

Text can appear in different ways inside a PDF file. For instance: one can draw all the glyphs using graphics state operators instead of using text state. (I'm sorry if this sounds like Chinese to you, but I can assure you it's proper PDF language.)

If an ad hoc solution that covers the most common situations and misses an exotic PDF once in a while is OK for you, then you already have a good first workaround.

In your code, you loop over all the pages, and you ask iText if there's any text on the page. That's already a good indication.

Internally, your code is using the RenderListener interface. iText parses the content of a page and triggers methods in a specific RenderListener implementation. This is an implementation of a custom implementation: MyTextRenderListener. This custom implementation is used in the ParsingHelloWorld example.

There's also a renderImage() method (see for instance MyImageListener). If this method is triggered, you're 100% sure that there's also an Image in the page, and you can use the ImageRenderInfo object to obtain the position, width and the height of the image (that is: if you know how to interpret the Matrix returned by the getImageCTM() method).

Using all these elements, you can already get a long way to achieving what you need, but be aware that there will always be exotic PDFs that will escape all your checks.

like image 56
Bruno Lowagie Avatar answered Oct 13 '22 13:10

Bruno Lowagie