I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:
GIF
I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>
), so that I am not able to parse the data precice enough.
That is why I would like to know if there is an other way to do it?
Tabula is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.
I have implemented my own algorithm ( its name is traprange
) to parse tabular data in pdf files.
Following are some sample pdf files and results:
Visit my project page at traprange
or my article at traprange
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With