Using PDFbox to determine the coordinates of words in a document

Tags:

I'm using PDFbox to extract the coordinates of words/strings in a PDF document, and have so far had success determining the position of individual characters. this is the code thus far, from the PDFbox doc:

package printtextlocations;

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * @param text The text to be processed
     */
    @Override /* this is questionable, not sure if needed... */
    protected void processTextPosition(TextPosition text) {
        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());
    }
}

This produces a series of lines containing the position of each character, including spaces, that looks like this:

String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P

Where 'P' is the character. I have not been able to find a function in PDFbox to find words, and I am not familiar enough with Java to be able to accurately concatenate these characters back into words to search through even though the spaces are also included. Has anyone else been in a similar situation, and if so how did you approach it? I really only need the coordinate of the first character in the word so that parts simplified, but as to how I'm going to match a string against that kind of output is beyond me.

377

asked Aug 08 '12 21:08

jbrain

Video Answer

1 Answers

There is no function in PDFBox that allows you to extract words automatically. I'm currently working on extracting data to gather it into blocks and here is my process:

I extract all the characters of the document (called glyphs) and store them in a list.
I do an analysis of the coordinates of each glyph, looping over the list. If they overlap (if the top of the current glyph is contained between the top and bottom of the preceding/or the bottom of the current glyph is contained between the top and bottom of the preceding one), I add it to the same line.
At this point, I have extracted the different lines of the document (be careful, if your document is multi-column, the expression "lines" means all the glyphs that overlap vertically, ie the text of all the columns that have the same vertical coordinates).
Then, you can compare the left coordinate of the current glyph to the right coordinate of the preceding one to determine if they belong to the same word or not (the PDFTextStripper class provides a getSpacingTolerance() method that gives you, based on trials and errors, the value of a "normal" space. If the difference between the right and the left coordinates is lower than this value, both glyphs belong to the same word.

I applied this method to my work and it works good.

184

answered Nov 15 '22 14:11

Nicolas W.

Related questions
                            
                                Which java.lang.Class method generates the right input for Class.forName()?
                            
                                Java Stream Using Previous Element in Foreach Lambda
                            
                                Must partitioningBy produce a map with entries for true and false?
                            
                                How does jackson set private properties without setters?
                            
                                Spring Mongo DB @DBREF
                            
                                Why doesn't Gradle or Maven have a dependency version lock file?
                            
                                Java 11: New HTTP client send POST requests with x-www-form-urlencoded parameters
                            
                                Rounding a Java BigDecimal to the nearest interval
                            
                                How to use a regex to search backwards effectively?
                            
                                Long primitive or AtomicLong for a counter?
                            
                                Why use Java's AsynchronousFileChannel?
                            
                                When are Java Strings interned?
                            
                                Finite State Machine (FSM) and Android's Java
                            
                                Type Erasure and Overloading in Java: Why does this work?
                            
                                What is the difference between having a class as final and having a class constructor as private
                            
                                Camel Routes and Endpoints
                            
                                What is the fastest way to bulk load data into HBase programmatically?
                            
                                How to create temporary procedures in MySQL?
                            
                                Display emoji/emotion icon in Android TextView
                            
                                NumberFormatException on valid number String

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using PDFbox to determine the coordinates of words in a document

Tags:

java

pdf

pdfbox

jbrain

People also ask

Video Answer

1 Answers

Nicolas W.

Recent Activity

Donate For Us