Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Text and Image Locations (x.y coordinates) using PDFBox

Tags:

java

pdfbox

I am doing a java program to read encrypted PDF files and extract the contents of the file page by page including the text, images and their positions(x,y coordinates) in the file. Now I'm using PDFBox for this purpose and I'm getting the text and images. But I couldn't get the text position and image position. Also there are some problems reading some encrypted PDF files.

like image 508
Suresh Somanathan Avatar asked Sep 28 '11 09:09

Suresh Somanathan


People also ask

What is the use of PDFBox?

It allows the creation of new PDF documents, manipulation of existing documents, bookmarking PDF and the ability to extract content from PDF documents. We can also use it to digitally sign, print and validate files against the PDF/A-1b standard. PDFBox library was originally developed in 2002 by Ben Litchfield.


1 Answers

Take a look at org.apache.pdfbox.examples.util.PrintTextLocations. I've used it quite a bit and it's very helpful to make analyses on the layout of elements and bounding boxes in PDF documents. It also revealed items printed in white ink, or outside the printable area (presumably document watermarks, or "forgotten" items pushed out of sight by the author).

Usage example:

java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt

You'll get something like that:

Processing page: 0
String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A
String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f
String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e
...

Which you can easily parse and use to plot element's position, bounding-box, and the "flow" (trajectory through all the elements), etc. for each page. As I'm sure you are already aware of, you'll find that PDF can be almost impossible to convert to text. It is really just a graphic description format (i.e. for the printer or the screen), not a markup language. You could easily make a PDF that prints "Hello world", but that jumps randomly through the character positions (and that uses different glyphs than any ISO char encoding, if you so choose), making the PDF very hard to convert to text. There is no notion of "word" or "paragraph". A two-column document, for example, can be a nightmare to parse into text.

For the second part of your question, I had good results using xpdf version 3.02, after fixing Xref.cc (make XRef::okToPrint(),XRef::okToChange(),XRef::okToCopy() and XRef::okToAddNotes() all return gTrue). That's to handle locked documents, not encrypted ones (there are other utils out there for that).

like image 51
Pierre D Avatar answered Sep 20 '22 21:09

Pierre D