I am using iText to extract some text from a pdf file at a specific location. In order to do that I am using the LocationTextExtractionStrategy:
public static void main(String[] args) throws Exception {
PdfReader pdfReader = new PdfReader("location_text_extraction_test.pdf");
Rectangle rectangle = new Rectangle(38, 0, 516, 516);
RenderFilter[] filter = {new RegionTextRenderFilter(rectangle)};
TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
String text = PdfTextExtractor.getTextFromPage(pdfReader, 1, strategy);
System.out.println(text);
pdfReader.close();
}
Link to pdf file
The problem is that the extracted text is in the wrong order:
What should be extracted as:
Part Description Quantity Unit Price Total For Line Extended Price Landing Fee 1.00 407.84 $ USD 407.84 407.84 $
is extracted as:
Total For Line Extended Price Part Description Quantity Unit Price 1.00 407.84 $ USD 407.84 407.84 $ Landing Fee
Note that when I open the pdf in Acrobat, select all the text with Ctrl+A, copy and then paste it in a text editor everything is in the correct order.
Is there a way to resolve the problem ? Thanks a lot ;)
The cause for this simply is that "Total For Line Extended Price" is at a y coordinate of 507.37 while "Part Description Quantity Unit Price" is at a y coordinate of 506.42.
The LocationTextExtractionStrategy
allows for small variations by only considering the integer part of the y coordinates but even the integer parts differ here. Thus, it assumes the former headings to be on a line above the latter ones and outputs its results accordingly.
In case of such variations usually a first attempt might be to try the SimpleTextExtractionStrategy
. Unfortunately this does not help here as the former text actually is drawn before the latter text. Thus, this strategy also returns the headings in the wrong order.
In such a situation you need a strategy that works differently, e.g. the strategy HorizontalTextExtractionStrategy or HorizontalTextExtractionStrategy2 (depending on your iText version, the former one up to iText 5.5.8, the latter one for the current development code 5.5.9-SNAPSHOT) from this answer. Using it you'll get
Part Description Quantity Unit Price Total For Line Extended Price
Landing Fee 1.00 407.84 $ USD 407.84 407.84 $
Parking 1.00 101.96$ USD 101.96 101.96$
??? 1.00 51.65$ USD 51.65 51.65$
Pax Baggage Handling Fee 5.00 8.49$ USD 42.45 42.45 $
Pax Airport Tax 5.00 26.36 $ USD 131.80 131.80$
GA terminal for crew on Arr ferry fit 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Pax on Dep. 5.00 124.00$ USD 620.00 620.00 $
GA terminal for crew on dep. 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Guest on Dep. 1.00 38.00$ USD 38.00 38.00 $
Crew transfer on arr 1.00 70.00 $ USD 70.00 70.00 $
Crew transfer on dep 1.00 70.00 $ USD 70.00 70.00 $
Lavatory Service 1.00 75.00 $ USD 75.00 75.00 $
Catering-ISS 1.00 1,324.28 $ USD 1,324.28 1,324.28 $
Ground Handling 1.00 190.00$ USD 190.00 190.00$
Pax Handling 1.00 190.00$ USD 190.00 190.00$
Push Back 1.00 83.00 $ USD 83.00 83.00 $
Towing 1.00 110.00$ USD 110.00 110.00$
(result of using TextExtraction
test method testLocation_text_extraction_test
)
Unfortunately, though, these strategies fail if there are overlapping lines in different side-by-side columns, e.g. in your document the invoice recipient address and the information to its right.
You might either try to tweak the horizontal strategies (e.g. by also analyzing horizontal gaps separating columns) or try a combined approach, using the output of multiple strategies for the same document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With