I want to extract different content from a PDF file in Java: <ul> <li>The complete visible text</li> <li>images</li> <li>links</li> </ul> Is it also possible to get the following? <ul> <li>document meta tags like title, description or author</li> <li>only headlines</li> <li>input elements if the document contains a form</li> </ul> I do not need to manipulate or render PDF files. Which library would be the best fit for that kind of purpose? UPDATE OK, I tried PDFBox: <pre class="prettyprint"><code>Document luceneDocument = LucenePDFDocument.getDocument(new File(path)); Field contents = luceneDocument.getField("contents"); System.out.println(contents.stringValue()); </code></pre> But the output is null. The field "summary" is OK though. The next snippet works fine. <pre class="prettyprint"><code>PDDocument doc = PDDocument.load(path); PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(doc); System.out.println(text); doc.close(); </code></pre> But then, I have no clue how to extract the images, links, etc. UPDATE 2 I found an example how to extract the images, but I still got no answer on how to extract: <ul> <li>links</li> <li>document meta tags like title, description or author</li> <li>only headlines</li> <li>input elements if the document contains a form</li> </ul>

iText is my PDF tool of choice these days. <blockquote> <ul> <li>The complete visible text</li> </ul> </blockquote> "Visible" is a tough one. You can parse out all the parsable text with the com.itextpdf.text.pdf.parse package's classes... but those classes don't know about CLIPPING. You can constrain the parser to the page size easily enough. <pre class="prettyprint"><code>// all text on the page, regardless of position PdfTextExtractor.getTextFromPage(reader, pageNum); </code></pre> You'd actually need the override that takes a TextExtractionStrategy, the filtered strategy. It gets interesting fairly quickly, but I think you can get everything you want here "out of the box". <blockquote> <ul> <li>images</li> </ul> </blockquote> Yep, via the same package classes. Image listeners aren't as well supported as text listeners, but do exist. <blockquote> <ul> <li>links</li> </ul> </blockquote> Yes. Links are "annotations" to various PDF pages. Finding them is a simple matter of looping through each page's "annotations array" and picking out the link annotations. <pre class="prettyprint"><code>PdfDictionary pageDict = myReader.getPageN(1); PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS); ArrayList<String> dests = new ArrayList<String>(); if (annots != null) { for (int i = 0; i < annots.size(); ++i) { PdfDictionary annotDict = annots.getAsDict(i); PdfName subType = annotDict.getAsName(PdfName.SUBTYPE); if (subType != null && PdfName.LINK.equals(subType)) { PdfDictionary action = annotDict.getAsDict(PdfName.A); if (action != null && PdfName.URI.equals(action.getAsName(PdfName.S)) { dests.add(action.getAsString(PdfName.URI).toString()); } // else { its an internal link, meh } } } } </code></pre> You can find the PDF Spec here. <blockquote> <ul> <li>input elements</li> </ul> </blockquote> Definitely. For either XFA (LiveCycle Designer) or the older-tech "AcroForm" forms, iText can find all the fields, and their values. <pre class="prettyprint"><code>AcroFields fields = myReader.getAcroFields(); Set<String> fieldNames = fields.getFields().keySet(); for (String fldName : fieldNames) { System.out.println( fldName + ": " + fields.getField( fldName ) ); } </code></pre> Mutli-select lists wouldn't be handled all that well. You'll get a blank space after the colon for empty text fields and for buttons. None too informative... but that'll get you started. <blockquote> <ul> <li>document meta tags like title, description or author</li> </ul> </blockquote> Pretty trivial. Yes. <pre class="prettyprint"><code>Map<String, String> info = myPdfReader.getInfo(); System.out.println( info ); </code></pre> In addition to the basic author/title/etc, there's a fairly involved XML schema you can access via <code>reader.getMetadata()</code>. <blockquote> <ul> <li>only headlines</li> </ul> </blockquote> A <code>TextRenderFilter</code> can ignore text based on whatever criteria you wish. Font size sounds about right based on your comment.

Apache comes to the rescue, once again.

Advanced PDF parser for Java

Tags:

java

parsing

pdf

I want to extract different content from a PDF file in Java:

The complete visible text
images
links

Is it also possible to get the following?

document meta tags like title, description or author
only headlines
input elements if the document contains a form

I do not need to manipulate or render PDF files. Which library would be the best fit for that kind of purpose?

UPDATE

OK, I tried PDFBox:

Document luceneDocument = LucenePDFDocument.getDocument(new File(path));
Field contents = luceneDocument.getField("contents");
System.out.println(contents.stringValue());

But the output is null. The field "summary" is OK though.

The next snippet works fine.

PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(doc);
System.out.println(text);
doc.close();

But then, I have no clue how to extract the images, links, etc.

UPDATE 2

I found an example how to extract the images, but I still got no answer on how to extract:

links
document meta tags like title, description or author
only headlines
input elements if the document contains a form

489

asked Mar 27 '11 14:03

Alp

2 Answers

iText is my PDF tool of choice these days.

The complete visible text

"Visible" is a tough one. You can parse out all the parsable text with the com.itextpdf.text.pdf.parse package's classes... but those classes don't know about CLIPPING. You can constrain the parser to the page size easily enough.

// all text on the page, regardless of position
PdfTextExtractor.getTextFromPage(reader, pageNum);

You'd actually need the override that takes a TextExtractionStrategy, the filtered strategy. It gets interesting fairly quickly, but I think you can get everything you want here "out of the box".

images

Yep, via the same package classes. Image listeners aren't as well supported as text listeners, but do exist.

links

Yes. Links are "annotations" to various PDF pages. Finding them is a simple matter of looping through each page's "annotations array" and picking out the link annotations.

PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
ArrayList<String> dests = new ArrayList<String>();
if (annots != null) {
  for (int i = 0; i < annots.size(); ++i) {
    PdfDictionary annotDict = annots.getAsDict(i);
    PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
    if (subType != null && PdfName.LINK.equals(subType)) {
      PdfDictionary action = annotDict.getAsDict(PdfName.A);
      if (action != null && PdfName.URI.equals(action.getAsName(PdfName.S)) {
        dests.add(action.getAsString(PdfName.URI).toString());
      } // else { its an internal link, meh }
    }
  }
}

You can find the PDF Spec here.

input elements

Definitely. For either XFA (LiveCycle Designer) or the older-tech "AcroForm" forms, iText can find all the fields, and their values.

AcroFields fields = myReader.getAcroFields();

Set<String> fieldNames = fields.getFields().keySet();
for (String fldName : fieldNames) {
  System.out.println( fldName + ": " + fields.getField( fldName ) );
}

Mutli-select lists wouldn't be handled all that well. You'll get a blank space after the colon for empty text fields and for buttons. None too informative... but that'll get you started.

document meta tags like title, description or author

Pretty trivial. Yes.

Map<String, String> info = myPdfReader.getInfo();
System.out.println( info );

In addition to the basic author/title/etc, there's a fairly involved XML schema you can access via reader.getMetadata().

only headlines

A TextRenderFilter can ignore text based on whatever criteria you wish. Font size sounds about right based on your comment.

150

answered Oct 09 '22 08:10

Mark Storer

Apache comes to the rescue, once again.

answered Oct 09 '22 08:10

Dhaivat Pandya

Related questions
                            
                                Classpath issue between jetty-maven-plugin and tomcat-jdbc 8.0.9+ leading to ServiceConfigurationError
                            
                                Pointcut for annotated methods or methods in annotated classes
                            
                                What is the "mnemonicParsing" attribute in Java FX
                            
                                Keycloak Logout Request
                            
                                System.exit(0) does not prevent finally being called when have a SecurityManager.checkExit throw an exception
                            
                                How to validate that a Java 8 Stream has two specific elements in it?
                            
                                When Hibernate flushes a Session, how does it decide which objects in the session are dirty?
                            
                                Java Generics and numbers
                            
                                spring authentication provider VS authentication processing filter
                            
                                Why are variables declared with their interface name in Java? [duplicate]
                            
                                How do I check if a variable has been initialized
                            
                                Extracting rightmost N bits of an integer
                            
                                How to log values that Hibernate binds to prepared statements?
                            
                                TableModel vs ColumnModel: who owns the column value?
                            
                                How to request a URL that requires a client certificate for authentication
                            
                                Is rollback needed if java.sql.Connection#commit() throws exception?
                            
                                How to override log4j.properties during testing?
                            
                                Concatenate strings within a Spring XML configuration file?
                            
                                Is a JVM stopped while executing jmap?
                            
                                Cancel an HttpClient request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With