PDF table extraction

Question

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

PDFBox || iText (Java)
Google Docs Import
PDF2HTML || PDF2Table

GIF

Tesseract-OCR

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>), so that I am not able to parse the data precice enough.

That is why I would like to know if there is an other way to do it?

thadk · Accepted Answer

Tabula is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.

Tho · Answer

I have implemented my own algorithm ( its name is traprange ) to parse tabular data in pdf files.

Following are some sample pdf files and results:

Input file: sample-1.pdf, result: sample-1.html
Input file: sample-4.pdf, result: sample-4.html

Visit my project page at traprange

or my article at traprange

PDF table extraction

Tags:

pdf

extraction

pdfbox

Vilius

2 Answers

thadk

Tho

Recent Activity

Donate For Us

PDF table extraction

Tags:

pdf

extraction

pdfbox

Vilius

2 Answers

thadk

Tho

Related questions

Recent Activity

Donate For Us