I am using iTextSharp and the reader.GetPageContent method to pull the text out of a PDF. I need to find the rectangle/position for each word found in the document. Is there any way to get the rectangle/position of a word in a PDF using iTextSharp?
Yes there is. Check out the text.pdf.parser
package, specifically LocationTextExtractionStrategy
. Actually, that might not do the trick either. You'll probably want to write your own TextExtractionStrategy
to feed into PdfTextExtractor:
MyTexExStrat strat = new MyTexExStrat();
PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
// get the strings-n-rects from strat.
public class MyTexExStrat implements TextExtractionStrategy {
void beginTextBlock() {}
void endTextBlock() {}
void renderImage(ImageRenderInfo info) {}
void renderText(TextRenderInfo info) {
// track text and location here.
}
}
You'll probably want to look at the source for LocationTextExtractionStrategy to see how it combines text that shares a baseline. You might even just modify LTES to store parallel arrays of strings and rects.
PS: to build the rects, you can just get the AscentLine & DescentLine and use those coordinates as the top and bottom corners:
Vector bottomLeft = info.getDescentLine().getStartPoint();
Vector topRight = info.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
bottomLeft.get(Vector.I2),
topRight.get(Vector.I1),
topRight.get(Vector.I2));
Warning: The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text. For most applications, the above should be fine, but know it's limits.
Good hunting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With