I am using Pdfbox to search a word(or String) from a pdf file and I also want to know the coordinates of that word. For example :- in a pdf file there is a string like "${abc}". I want to know the coordinates of this string. I Tried some couple of examples but didn't get the result according to me. in result it is displaying the coordinates of character.
Here is the Code
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
for(TextPosition text : textPositions) {
System.out.println( "String[" + text.getXDirAdj() + "," +
text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
text.getXScale() + " height=" + text.getHeightDir() + " space=" +
text.getWidthOfSpace() + " width=" +
text.getWidthDirAdj() + "]" + text.getUnicode());
}
}
I am using pdfbox 2.0
The last method in which PDFBox' PDFTextStripper
class still has text with positions (before it is reduced to plain text) is the method
/**
* Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
* and just calls {@link #writeString(String)}.
*
* @param text The text to write to the stream.
* @param textPositions The TextPositions belonging to the text.
* @throws IOException If there is an error when writing the text.
*/
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
One should intercept here because this method receives pre-processed, in particular sorted TextPosition
objects (if one requested sorting to start with).
(Actually I would have preferred to intercept in the calling method writeLine
which according to the names of its parameters and local variables has all the TextPosition
instances of a line and calls writeString
once per word
; unfortunately, though, PDFBox developers have declared this method private... well, maybe this changes until the final 2.0.0 release... nudge, nudge. Update: Unfortunately it has not changed in the release... sigh)
Furthermore it is helpful to use a helper class to wrap sequences of TextPosition
instances in a String
-like class to make code clearer.
With this in mind one can search for the variables like this
List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
PDFTextStripper stripper = new PDFTextStripper()
{
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
TextPositionSequence word = new TextPositionSequence(textPositions);
String string = word.toString();
int fromIndex = 0;
int index;
while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
{
hits.add(word.subSequence(index, index + searchTerm.length()));
fromIndex = index + 1;
}
super.writeString(text, textPositions);
}
};
stripper.setSortByPosition(true);
stripper.setStartPage(page);
stripper.setEndPage(page);
stripper.getText(document);
return hits;
}
with this helper class
public class TextPositionSequence implements CharSequence
{
public TextPositionSequence(List<TextPosition> textPositions)
{
this(textPositions, 0, textPositions.size());
}
public TextPositionSequence(List<TextPosition> textPositions, int start, int end)
{
this.textPositions = textPositions;
this.start = start;
this.end = end;
}
@Override
public int length()
{
return end - start;
}
@Override
public char charAt(int index)
{
TextPosition textPosition = textPositionAt(index);
String text = textPosition.getUnicode();
return text.charAt(0);
}
@Override
public TextPositionSequence subSequence(int start, int end)
{
return new TextPositionSequence(textPositions, this.start + start, this.start + end);
}
@Override
public String toString()
{
StringBuilder builder = new StringBuilder(length());
for (int i = 0; i < length(); i++)
{
builder.append(charAt(i));
}
return builder.toString();
}
public TextPosition textPositionAt(int index)
{
return textPositions.get(start + index);
}
public float getX()
{
return textPositions.get(start).getXDirAdj();
}
public float getY()
{
return textPositions.get(start).getYDirAdj();
}
public float getWidth()
{
if (end == start)
return 0;
TextPosition first = textPositions.get(start);
TextPosition last = textPositions.get(end - 1);
return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();
}
final List<TextPosition> textPositions;
final int start, end;
}
To merely output their positions, widths, final letters, and final letter positions, you can then use this
void printSubwords(PDDocument document, String searchTerm) throws IOException
{
System.out.printf("* Looking for '%s'\n", searchTerm);
for (int page = 1; page <= document.getNumberOfPages(); page++)
{
List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);
for (TextPositionSequence hit : hits)
{
TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);
System.out.printf(" Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",
page, hit.getX(), hit.getY(), hit.getWidth(),
lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());
}
}
}
For tests I created a small test file using MS Word:
The output of this test
@Test
public void testVariables() throws IOException
{
try ( InputStream resource = getClass().getResourceAsStream("Variables.pdf");
PDDocument document = PDDocument.load(resource); )
{
System.out.println("\nVariables.pdf\n-------------\n");
printSubwords(document, "${var1}");
printSubwords(document, "${var 2}");
}
}
is
Variables.pdf
-------------
* Looking for '${var1}'
Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06
Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995
Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997
Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18
* Looking for '${var 2}'
Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
I was a bit surprised because ${var 2}
has been found if on a single line; after all, PDFBox code made me assume the method writeString
I overrode only retrieves words; it looks as if it retrieves longer parts of the line than mere words...
If you need other data from the grouped TextPosition
instances, simply enhance TextPositionSequence
accordingly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With