Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using PDFbox to determine the coordinates of words in a document

Tags:

java

pdf

pdfbox

I'm using PDFbox to extract the coordinates of words/strings in a PDF document, and have so far had success determining the position of individual characters. this is the code thus far, from the PDFbox doc:

package printtextlocations;

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * @param text The text to be processed
     */
    @Override /* this is questionable, not sure if needed... */
    protected void processTextPosition(TextPosition text) {
        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());
    }
}

This produces a series of lines containing the position of each character, including spaces, that looks like this:

String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P

Where 'P' is the character. I have not been able to find a function in PDFbox to find words, and I am not familiar enough with Java to be able to accurately concatenate these characters back into words to search through even though the spaces are also included. Has anyone else been in a similar situation, and if so how did you approach it? I really only need the coordinate of the first character in the word so that parts simplified, but as to how I'm going to match a string against that kind of output is beyond me.

like image 377
jbrain Avatar asked Aug 08 '12 21:08

jbrain


People also ask

What is PDFBox used for?

Apache PDFBox is an open source Java library that can be used to create, render, print, split, merge, alter, verify and extract text and meta-data of PDF files.

Which is better iText or PDFBox?

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot.


Video Answer


1 Answers

There is no function in PDFBox that allows you to extract words automatically. I'm currently working on extracting data to gather it into blocks and here is my process:

  1. I extract all the characters of the document (called glyphs) and store them in a list.

  2. I do an analysis of the coordinates of each glyph, looping over the list. If they overlap (if the top of the current glyph is contained between the top and bottom of the preceding/or the bottom of the current glyph is contained between the top and bottom of the preceding one), I add it to the same line.

  3. At this point, I have extracted the different lines of the document (be careful, if your document is multi-column, the expression "lines" means all the glyphs that overlap vertically, ie the text of all the columns that have the same vertical coordinates).

  4. Then, you can compare the left coordinate of the current glyph to the right coordinate of the preceding one to determine if they belong to the same word or not (the PDFTextStripper class provides a getSpacingTolerance() method that gives you, based on trials and errors, the value of a "normal" space. If the difference between the right and the left coordinates is lower than this value, both glyphs belong to the same word.

I applied this method to my work and it works good.

like image 184
Nicolas W. Avatar answered Nov 15 '22 14:11

Nicolas W.