Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

iText: Extracted text from pdf file using LocationTextExtractionStrategy is in wrong order

I am using iText to extract some text from a pdf file at a specific location. In order to do that I am using the LocationTextExtractionStrategy:

public static void main(String[] args) throws Exception {

    PdfReader pdfReader = new PdfReader("location_text_extraction_test.pdf");

    Rectangle rectangle = new Rectangle(38, 0, 516, 516);

    RenderFilter[] filter = {new RegionTextRenderFilter(rectangle)};
    TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
    String text = PdfTextExtractor.getTextFromPage(pdfReader, 1, strategy);

    System.out.println(text);

    pdfReader.close();
}

Link to pdf file

The problem is that the extracted text is in the wrong order:

enter image description here

What should be extracted as:

Part Description Quantity Unit Price Total For Line Extended Price
Landing Fee 1.00 407.84 $ USD 407.84 407.84 $

is extracted as:

Total For Line Extended Price
Part Description Quantity Unit Price
1.00 407.84 $ USD 407.84 407.84 $
Landing Fee

Note that when I open the pdf in Acrobat, select all the text with Ctrl+A, copy and then paste it in a text editor everything is in the correct order.

Is there a way to resolve the problem ? Thanks a lot ;)

like image 539
Olivier Masseau Avatar asked Feb 11 '16 16:02

Olivier Masseau


1 Answers

The cause for this simply is that "Total For Line Extended Price" is at a y coordinate of 507.37 while "Part Description Quantity Unit Price" is at a y coordinate of 506.42.

The LocationTextExtractionStrategy allows for small variations by only considering the integer part of the y coordinates but even the integer parts differ here. Thus, it assumes the former headings to be on a line above the latter ones and outputs its results accordingly.

In case of such variations usually a first attempt might be to try the SimpleTextExtractionStrategy. Unfortunately this does not help here as the former text actually is drawn before the latter text. Thus, this strategy also returns the headings in the wrong order.

In such a situation you need a strategy that works differently, e.g. the strategy HorizontalTextExtractionStrategy or HorizontalTextExtractionStrategy2 (depending on your iText version, the former one up to iText 5.5.8, the latter one for the current development code 5.5.9-SNAPSHOT) from this answer. Using it you'll get

Part Description Quantity Unit Price Total For Line Extended Price
Landing Fee 1.00 407.84 $ USD 407.84 407.84 $
Parking 1.00 101.96$ USD 101.96 101.96$
??? 1.00 51.65$ USD 51.65 51.65$
Pax Baggage Handling Fee 5.00 8.49$ USD 42.45 42.45 $
Pax Airport Tax 5.00 26.36 $ USD 131.80 131.80$
GA terminal for crew on Arr ferry fit 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Pax on Dep. 5.00 124.00$ USD 620.00 620.00 $
GA terminal for crew on dep. 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Guest on Dep. 1.00 38.00$ USD 38.00 38.00 $
Crew transfer on arr 1.00 70.00 $ USD 70.00 70.00 $
Crew transfer on dep 1.00 70.00 $ USD 70.00 70.00 $
Lavatory Service 1.00 75.00 $ USD 75.00 75.00 $
Catering-ISS 1.00 1,324.28 $ USD 1,324.28 1,324.28 $
Ground Handling 1.00 190.00$ USD 190.00 190.00$
Pax Handling 1.00 190.00$ USD 190.00 190.00$
Push Back 1.00 83.00 $ USD 83.00 83.00 $
Towing 1.00 110.00$ USD 110.00 110.00$

(result of using TextExtraction test method testLocation_text_extraction_test)

Unfortunately, though, these strategies fail if there are overlapping lines in different side-by-side columns, e.g. in your document the invoice recipient address and the information to its right.

You might either try to tweak the horizontal strategies (e.g. by also analyzing horizontal gaps separating columns) or try a combined approach, using the output of multiple strategies for the same document.

like image 125
mkl Avatar answered Oct 21 '22 02:10

mkl