Read Text and Image Locations (x.y coordinates) using PDFBox

Tags:

I am doing a java program to read encrypted PDF files and extract the contents of the file page by page including the text, images and their positions(x,y coordinates) in the file. Now I'm using PDFBox for this purpose and I'm getting the text and images. But I couldn't get the text position and image position. Also there are some problems reading some encrypted PDF files.

508

asked Sep 28 '11 09:09

Suresh Somanathan

1 Answers

Take a look at org.apache.pdfbox.examples.util.PrintTextLocations. I've used it quite a bit and it's very helpful to make analyses on the layout of elements and bounding boxes in PDF documents. It also revealed items printed in white ink, or outside the printable area (presumably document watermarks, or "forgotten" items pushed out of sight by the author).

Usage example:

java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt

You'll get something like that:

Processing page: 0
String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A
String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f
String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e
...

Which you can easily parse and use to plot element's position, bounding-box, and the "flow" (trajectory through all the elements), etc. for each page. As I'm sure you are already aware of, you'll find that PDF can be almost impossible to convert to text. It is really just a graphic description format (i.e. for the printer or the screen), not a markup language. You could easily make a PDF that prints "Hello world", but that jumps randomly through the character positions (and that uses different glyphs than any ISO char encoding, if you so choose), making the PDF very hard to convert to text. There is no notion of "word" or "paragraph". A two-column document, for example, can be a nightmare to parse into text.

For the second part of your question, I had good results using xpdf version 3.02, after fixing Xref.cc (make XRef::okToPrint(),XRef::okToChange(),XRef::okToCopy() and XRef::okToAddNotes() all return gTrue). That's to handle locked documents, not encrypted ones (there are other utils out there for that).

answered Sep 20 '22 21:09

Pierre D

Related questions
                            
                                serialport write and read on windows not working
                            
                                Proper Way of Handling an Orientation Change in Android
                            
                                Iterating through the union of several Java Map key sets efficiently
                            
                                Swing BoxLayout problem - Can't make the Fillers do their job
                            
                                How to set up a <Resource> in Tomcat 7 so that I don't need to use "java:/comp/env" in the code?
                            
                                How to get JSON data in chunks to report on progress? [duplicate]
                            
                                deciphering linux encfs (standard config, 192 bit aes) in Java
                            
                                Binding a User entity and a GlassFish Principal
                            
                                Generic Type toArray and asList
                            
                                Limiting frame rate with Thread.sleep()
                            
                                gwt serialization policy hosted mode out of sync
                            
                                What buffering strategy should I use for my 2D scrolling map?
                            
                                How can I detect if a network card is not connected with Java without delay?
                            
                                Java Open Source Image Optimization libraries [closed]
                            
                                Obtain FQDN in Java
                            
                                Is @Cacheable aware of the 'Thundering Herd' problem?
                            
                                Getting an arbitrary element from a set
                            
                                How to reference mockito within tycho?
                            
                                I want to create a red "night mode" for my Android app
                            
                                How to persist an entity from an non-entity subclass in Hibernate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read Text and Image Locations (x.y coordinates) using PDFBox

Tags:

java

pdfbox

Suresh Somanathan

People also ask

1 Answers

Pierre D

Recent Activity

Donate For Us