Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF table extraction

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

  • PDFBox || iText (Java)
  • Google Docs Import
  • PDF2HTML || PDF2Table

GIF

  • Tesseract-OCR

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>), so that I am not able to parse the data precice enough.

That is why I would like to know if there is an other way to do it?

like image 729
Vilius Avatar asked Apr 24 '12 15:04

Vilius


2 Answers

Tabula is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.

like image 73
thadk Avatar answered Oct 26 '22 23:10

thadk


I have implemented my own algorithm ( its name is traprange ) to parse tabular data in pdf files.

Following are some sample pdf files and results:

  1. Input file: sample-1.pdf, result: sample-1.html
  2. Input file: sample-4.pdf, result: sample-4.html

Visit my project page at traprange

or my article at traprange

like image 32
Tho Avatar answered Oct 26 '22 23:10

Tho