Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search some specific string or a word and there coordinates from a pdf document in java

Tags:

java

pdfbox

I am using Pdfbox to search a word(or String) from a pdf file and I also want to know the coordinates of that word. For example :- in a pdf file there is a string like "${abc}". I want to know the coordinates of this string. I Tried some couple of examples but didn't get the result according to me. in result it is displaying the coordinates of character.

Here is the Code

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    for(TextPosition text : textPositions) {


        System.out.println( "String[" + text.getXDirAdj() + "," +
                text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
                text.getXScale() + " height=" + text.getHeightDir() + " space=" +
                text.getWidthOfSpace() + " width=" +
                text.getWidthDirAdj() + "]" + text.getUnicode());

    }
}

I am using pdfbox 2.0

like image 828
Ashok Kanwal Avatar asked Dec 19 '22 18:12

Ashok Kanwal


1 Answers

The last method in which PDFBox' PDFTextStripper class still has text with positions (before it is reduced to plain text) is the method

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException

One should intercept here because this method receives pre-processed, in particular sorted TextPosition objects (if one requested sorting to start with).

(Actually I would have preferred to intercept in the calling method writeLine which according to the names of its parameters and local variables has all the TextPosition instances of a line and calls writeString once per word; unfortunately, though, PDFBox developers have declared this method private... well, maybe this changes until the final 2.0.0 release... nudge, nudge. Update: Unfortunately it has not changed in the release... sigh)

Furthermore it is helpful to use a helper class to wrap sequences of TextPosition instances in a String-like class to make code clearer.

With this in mind one can search for the variables like this

List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            TextPositionSequence word = new TextPositionSequence(textPositions);
            String string = word.toString();

            int fromIndex = 0;
            int index;
            while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
            {
                hits.add(word.subSequence(index, index + searchTerm.length()));
                fromIndex = index + 1;
            }
            super.writeString(text, textPositions);
        }
    };
    
    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);
    return hits;
}

with this helper class

public class TextPositionSequence implements CharSequence
{
    public TextPositionSequence(List<TextPosition> textPositions)
    {
        this(textPositions, 0, textPositions.size());
    }

    public TextPositionSequence(List<TextPosition> textPositions, int start, int end)
    {
        this.textPositions = textPositions;
        this.start = start;
        this.end = end;
    }

    @Override
    public int length()
    {
        return end - start;
    }

    @Override
    public char charAt(int index)
    {
        TextPosition textPosition = textPositionAt(index);
        String text = textPosition.getUnicode();
        return text.charAt(0);
    }

    @Override
    public TextPositionSequence subSequence(int start, int end)
    {
        return new TextPositionSequence(textPositions, this.start + start, this.start + end);
    }

    @Override
    public String toString()
    {
        StringBuilder builder = new StringBuilder(length());
        for (int i = 0; i < length(); i++)
        {
            builder.append(charAt(i));
        }
        return builder.toString();
    }

    public TextPosition textPositionAt(int index)
    {
        return textPositions.get(start + index);
    }

    public float getX()
    {
        return textPositions.get(start).getXDirAdj();
    }

    public float getY()
    {
        return textPositions.get(start).getYDirAdj();
    }

    public float getWidth()
    {
        if (end == start)
            return 0;
        TextPosition first = textPositions.get(start);
        TextPosition last = textPositions.get(end - 1);
        return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();
    }

    final List<TextPosition> textPositions;
    final int start, end;
}

To merely output their positions, widths, final letters, and final letter positions, you can then use this

void printSubwords(PDDocument document, String searchTerm) throws IOException
{
    System.out.printf("* Looking for '%s'\n", searchTerm);
    for (int page = 1; page <= document.getNumberOfPages(); page++)
    {
        List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);
        for (TextPositionSequence hit : hits)
        {
            TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);
            System.out.printf("  Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",
                    page, hit.getX(), hit.getY(), hit.getWidth(),
                    lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());
        }
    }
}

For tests I created a small test file using MS Word:

Sample file with variables

The output of this test

@Test
public void testVariables() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("Variables.pdf");
            PDDocument document = PDDocument.load(resource);    )
    {
        System.out.println("\nVariables.pdf\n-------------\n");
        printSubwords(document, "${var1}");
        printSubwords(document, "${var 2}");
    }
}

is

Variables.pdf
-------------

* Looking for '${var1}'
  Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06
  Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995
  Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997
  Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18

* Looking for '${var 2}'
  Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
  Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
  Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
  Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81

I was a bit surprised because ${var 2} has been found if on a single line; after all, PDFBox code made me assume the method writeString I overrode only retrieves words; it looks as if it retrieves longer parts of the line than mere words...

If you need other data from the grouped TextPosition instances, simply enhance TextPositionSequence accordingly.

like image 167
mkl Avatar answered Dec 21 '22 09:12

mkl