Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF Data Extraction - Need Suggestions

I created a pdf extraction tool. Sample screen attached. enter image description here User can load a pdf file and select data area he wants. Then I grab pdf coordinates and page number and then save it as a template. Once user a give a list of pdf files tool is capable of extracting data according to the template file. My tool is very much similar to this.

Now problem is sometimes in some pdfs the portion of data required to extract is shifted to next page. (The reason for shifting is; I will give a example. If you think a bill of list of items you purchased, The place of "Total Value" printed is depend on the number of items you bought: if it's a long list total goes bottom otherwise, middle or near top).

Therefore now I am thinking about identify the structure of the pdf instead of getting coordinates.

But I don't have a clear idea to do that. Please share anything, you think that help to solve this problem. I repeat again that I am trying to grab data from a pdf. So It is possible to capture the structure of an pdf file.

My idea is if I can identify the structure then I can say where the value is. For example I tried to convert pdf into html and try to navigate through the html tag values. (body->div->table->td-> etc.) But it wasn't successful.. :(

like image 747
yohan.jayarathna Avatar asked Apr 18 '26 03:04

yohan.jayarathna


2 Answers

PDF has only weak structures, nothing like divs or containers. There are layer groups and similar, but coordinates are the only thing, you can count on.

Try to describe type of text and margins from left and right, to make your capture page independent.

like image 51
p4553d Avatar answered Apr 21 '26 00:04

p4553d


The PDF file format includes an optional set of metatags. If these are used, the file will have some structure. Otherwise you are out of luck. I wrote a blog post telling you how to find this out at http://www.jpedal.org/PDFblog/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/

like image 43
mark stephens Avatar answered Apr 21 '26 01:04

mark stephens



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!