Parsing PDF files (especially with tables) with PDFBox

Tags:

I need to parse a PDF file which contains tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):

+----------------------------------------------------------------+ | AIH | Value | Complexity                     | Financing       | |     |       | Medium | High | Not applicable | MAC/Other | FAE | +----------------------------------------------------------------+ | xyz | 12.43 | 12.34  |      |                | 12.34     |     | +----------------------------------------------------------------+ | abc | 1.56  |        | 1.56 |                |           | 1.56| +----------------------------------------------------------------+

Then I use PDFBox:

PDDocument document = PDDocument.load(pathToFile); PDFTextStripper s = new PDFTextStripper(); String content = s.getText(document);

Those two lines of data would be extracted like this:

xyz 12.43 12.4312.43 abc 1.56 1.561.56

There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.

It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.

614

asked Jul 08 '10 12:07

Matheus Moreira

1 Answers

You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.

I suggest that you extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.

Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.

Good luck.

180

answered Sep 20 '22 04:09

purecharger

Related questions
                            
                                How do Java Interfaces simulate multiple inheritance?
                            
                                Difference between Return and Break statements
                            
                                How to return 404 response status in Spring Boot @ResponseBody - method return type is Response?
                            
                                ArrayList<String> to CharSequence[]
                            
                                Java replace all square brackets in a string
                            
                                Access restriction: Is not accessible due to restriction on required library ..\jre\lib\rt.jar
                            
                                Use an array as a case statement in switch
                            
                                Initialize 2D array
                            
                                Error:Execution failed for task ':app:processDebugResources'. > java.io.IOException: Could not delete folder "" in android studio
                            
                                ClassNotFoundException: Didn't find class "android.support.v4.content.FileProvider" after androidx migration
                            
                                Why is there no sub-class visibility modifier in Java?
                            
                                Java 8 Instant class not have plusHours method despite shown in Oracle Tutorial example code
                            
                                Should you call ReleaseStringUTFChars if GetStringUTFChars returned a copy?
                            
                                Java enum reverse look-up best practice
                            
                                Why does this Java 8 program not compile?
                            
                                > vs. >= in bubble sort causes significant performance difference
                            
                                Is there a package manager for Java like easy_install for Python? [closed]
                            
                                Analogues of Java and .NET technologies/frameworks
                            
                                What are Reified Generics? How do they solve Type Erasure problems and why can't they be added without major changes?
                            
                                How to terminate a thread blocking on socket IO operation instantly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing PDF files (especially with tables) with PDFBox

Tags:

java

parsing

pdf

tabular

pdfbox

Matheus Moreira

People also ask

1 Answers

purecharger

Recent Activity

Donate For Us