Are there any open source libraries that support table identification & extraction?
By this I mean:
I have looked through similar questions on this topic and found the following:
Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!
Online PDF to Excel converters like smallpdf and cometdocs among others offer the most basic PDF table extraction capabilities. Nanonets also offers a free PDF to Excel converter. These simple utility tools are free to use, but might require a mandatory sign up. Just upload a PDF and download the output.
After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry!
I hope you are using Linux;
pdftotext -layout NAME_OF_PDF.pdf
AMAZING!!
Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc..
It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!
You should definitely have a look at this answer of mine:
and also have a look at all the links included therein.
Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.
I'd just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula, and this seems to work very well so far: https://github.com/chezou/tabula-py
This will convert your PDF table to a Pandas data frame. You can also set the area in x,y co-ordinates which is obviously very handy for irregular data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With