Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Tables from PDF files in Ruby

What is the best way for extracting Tables which are embedded in PDF documents?

I am not interested solutions which work only for JRuby, or which make use of third-party APIs or web-sites.

Can you share some Ruby code on how to extract the table(s)? Which gems are best suited for the job?

I'm sure someone has had the same problem before :) I appreciate your help!

like image 490
Tilo Avatar asked Jan 28 '17 19:01

Tilo


People also ask

Can I extract tables from PDF?

There are a variety of methods you can use to extract tables from a PDF file and use them in your spreadsheets. You can use Excel and Power BI to extract and import tables from PDF into your spreadsheet as formatted tables. Alternatively, you can also use Adobe Acrobat DC to export your PDF as an Excel workbook file.

How do I extract multiple tables from a PDF?

The Tabula-py library is a tool to extract tables from PDFs and it works on Mac, Windows and Linux. It is a simple wrapper of tabula-java and it enables you to extract tables from PDF into CSV, TSV or JSON file.

How do I extract a table from a scanned PDF?

As per its name, Docparser is a parsing app that not only extracts tables from PDF but can extract any kind of data from any type of document, scanned image, or PDF. Docparser is a cloud-based application for extracting data from PDFs and scanned documents.


2 Answers

You may want to take a look at this answer (How to convert PDF to Excel or CSV in Rails 4). It solves the same problem you are trying to solve.

like image 149
theterminalguy Avatar answered Oct 10 '22 09:10

theterminalguy


Checkout this gem I think it's what your looking for: pdf-reader gem

like image 25
Zach Tuttle Avatar answered Oct 10 '22 07:10

Zach Tuttle