Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing PDF files (especially with tables) with PDFBox

I need to parse a PDF file which contains tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):

+----------------------------------------------------------------+ | AIH | Value | Complexity                     | Financing       | |     |       | Medium | High | Not applicable | MAC/Other | FAE | +----------------------------------------------------------------+ | xyz | 12.43 | 12.34  |      |                | 12.34     |     | +----------------------------------------------------------------+ | abc | 1.56  |        | 1.56 |                |           | 1.56| +----------------------------------------------------------------+ 

Then I use PDFBox:

PDDocument document = PDDocument.load(pathToFile); PDFTextStripper s = new PDFTextStripper(); String content = s.getText(document); 

Those two lines of data would be extracted like this:

xyz 12.43 12.4312.43 abc 1.56 1.561.56 

There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.

It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.

like image 614
Matheus Moreira Avatar asked Jul 08 '10 12:07

Matheus Moreira


People also ask

Which is better iText or PDFBox?

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot.

Is it possible to parse a PDF?

PDF files can be parsed with tabula-py, or tabula-java.


1 Answers

You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.

I suggest that you extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.

Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.

Good luck.

like image 180
purecharger Avatar answered Sep 20 '22 04:09

purecharger