Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract / Identify Tables from PDF python [closed]

Are there any open source libraries that support table identification & extraction?

By this I mean:

  1. Identify a table structure exists
  2. Classify the table from its contents
  3. Extract data from the table in a useful output format e.g. JSON / CSV etc.

I have looked through similar questions on this topic and found the following:

  • PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)
  • pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!

Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!

like image 689
Alexander McFarlane Avatar asked Feb 16 '15 00:02

Alexander McFarlane


People also ask

Can I extract tables from PDF?

Online PDF to Excel converters like smallpdf and cometdocs among others offer the most basic PDF table extraction capabilities. Nanonets also offers a free PDF to Excel converter. These simple utility tools are free to use, but might require a mandatory sign up. Just upload a PDF and download the output.


3 Answers

After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry!

I hope you are using Linux;

pdftotext -layout NAME_OF_PDF.pdf

AMAZING!!

Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc..

It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!

like image 175
Ike Avatar answered Oct 18 '22 20:10

Ike


You should definitely have a look at this answer of mine:

  • Extracting table contents from a collection of PDF files

and also have a look at all the links included therein.

Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.

like image 41
Kurt Pfeifle Avatar answered Oct 18 '22 22:10

Kurt Pfeifle


I'd just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula, and this seems to work very well so far: https://github.com/chezou/tabula-py

This will convert your PDF table to a Pandas data frame. You can also set the area in x,y co-ordinates which is obviously very handy for irregular data.

like image 18
Ricky McMaster Avatar answered Oct 18 '22 22:10

Ricky McMaster