I want to extract different content from a PDF file in Java:
Is it also possible to get the following?
I do not need to manipulate or render PDF files. Which library would be the best fit for that kind of purpose?
UPDATE
OK, I tried PDFBox:
Document luceneDocument = LucenePDFDocument.getDocument(new File(path));
Field contents = luceneDocument.getField("contents");
System.out.println(contents.stringValue());
But the output is null. The field "summary" is OK though.
The next snippet works fine.
PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(doc);
System.out.println(text);
doc.close();
But then, I have no clue how to extract the images, links, etc.
UPDATE 2
I found an example how to extract the images, but I still got no answer on how to extract:
JDK does not provide any class to read PDF file. In order to read a PDF file, we depend on the third-party library. There are several third-party libraries are available to read a PDF file. So, in this section, we will use the Apache Tika library for reading a PDF file in Java.
iText is my PDF tool of choice these days.
- The complete visible text
"Visible" is a tough one. You can parse out all the parsable text with the com.itextpdf.text.pdf.parse package's classes... but those classes don't know about CLIPPING. You can constrain the parser to the page size easily enough.
// all text on the page, regardless of position
PdfTextExtractor.getTextFromPage(reader, pageNum);
You'd actually need the override that takes a TextExtractionStrategy, the filtered strategy. It gets interesting fairly quickly, but I think you can get everything you want here "out of the box".
- images
Yep, via the same package classes. Image listeners aren't as well supported as text listeners, but do exist.
- links
Yes. Links are "annotations" to various PDF pages. Finding them is a simple matter of looping through each page's "annotations array" and picking out the link annotations.
PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
ArrayList<String> dests = new ArrayList<String>();
if (annots != null) {
for (int i = 0; i < annots.size(); ++i) {
PdfDictionary annotDict = annots.getAsDict(i);
PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
if (subType != null && PdfName.LINK.equals(subType)) {
PdfDictionary action = annotDict.getAsDict(PdfName.A);
if (action != null && PdfName.URI.equals(action.getAsName(PdfName.S)) {
dests.add(action.getAsString(PdfName.URI).toString());
} // else { its an internal link, meh }
}
}
}
You can find the PDF Spec here.
- input elements
Definitely. For either XFA (LiveCycle Designer) or the older-tech "AcroForm" forms, iText can find all the fields, and their values.
AcroFields fields = myReader.getAcroFields();
Set<String> fieldNames = fields.getFields().keySet();
for (String fldName : fieldNames) {
System.out.println( fldName + ": " + fields.getField( fldName ) );
}
Mutli-select lists wouldn't be handled all that well. You'll get a blank space after the colon for empty text fields and for buttons. None too informative... but that'll get you started.
- document meta tags like title, description or author
Pretty trivial. Yes.
Map<String, String> info = myPdfReader.getInfo();
System.out.println( info );
In addition to the basic author/title/etc, there's a fairly involved XML schema you can access via reader.getMetadata()
.
- only headlines
A TextRenderFilter
can ignore text based on whatever criteria you wish. Font size sounds about right based on your comment.
Apache comes to the rescue, once again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With