What is the best way for extracting Tables which are embedded in PDF documents?
I am not interested solutions which work only for JRuby, or which make use of third-party APIs or web-sites.
Can you share some Ruby code on how to extract the table(s)? Which gems are best suited for the job?
I'm sure someone has had the same problem before :) I appreciate your help!
There are a variety of methods you can use to extract tables from a PDF file and use them in your spreadsheets. You can use Excel and Power BI to extract and import tables from PDF into your spreadsheet as formatted tables. Alternatively, you can also use Adobe Acrobat DC to export your PDF as an Excel workbook file.
The Tabula-py library is a tool to extract tables from PDFs and it works on Mac, Windows and Linux. It is a simple wrapper of tabula-java and it enables you to extract tables from PDF into CSV, TSV or JSON file.
As per its name, Docparser is a parsing app that not only extracts tables from PDF but can extract any kind of data from any type of document, scanned image, or PDF. Docparser is a cloud-based application for extracting data from PDFs and scanned documents.
You may want to take a look at this answer (How to convert PDF to Excel or CSV in Rails 4). It solves the same problem you are trying to solve.
Checkout this gem I think it's what your looking for: pdf-reader gem
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With