How to determine whether a PDF page contains text or is purely picture, using Java?
I searched through many forums and websites, but I can not find an answer yet .
Is it possible to extract text from PDF, to know if the page is in the format picture or text?
PdfReader reader = new PdfReader(INPUTFILE);
PrintWriter out = new PrintWriter(new FileOutputStream(OUTPUTFILE));
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// here I want to test the structure of the page !!!! if it's possible
out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
There is no water-proof way to do what you want.
Text can appear in different ways inside a PDF file. For instance: one can draw all the glyphs using graphics state operators instead of using text state. (I'm sorry if this sounds like Chinese to you, but I can assure you it's proper PDF language.)
If an ad hoc solution that covers the most common situations and misses an exotic PDF once in a while is OK for you, then you already have a good first workaround.
In your code, you loop over all the pages, and you ask iText if there's any text on the page. That's already a good indication.
Internally, your code is using the RenderListener
interface. iText parses the content of a page and triggers methods in a specific RenderListener
implementation. This is an implementation of a custom implementation: MyTextRenderListener. This custom implementation is used in the ParsingHelloWorld example.
There's also a renderImage()
method (see for instance MyImageListener). If this method is triggered, you're 100% sure that there's also an Image in the page, and you can use the ImageRenderInfo
object to obtain the position, width and the height of the image (that is: if you know how to interpret the Matrix
returned by the getImageCTM()
method).
Using all these elements, you can already get a long way to achieving what you need, but be aware that there will always be exotic PDFs that will escape all your checks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With