Are there any open source libraries that support table identification & extraction? By this I mean: <ol> <li>Identify a table structure exists</li> <li>Classify the table from its contents</li> <li>Extract data from the table in a useful output format e.g. JSON / CSV etc.</li> </ol> I have looked through similar questions on this topic and found the following: <ul> <li> PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)</li> <li> pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!</li> </ul> Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!

After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry! I hope you are using Linux; <code>pdftotext -layout NAME_OF_PDF.pdf</code> AMAZING!! Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc.. It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!

You should definitely have a look at this answer of mine: <ul> <li>Extracting table contents from a collection of PDF files</li> </ul> and also have a look at all the links included therein. Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.

I'd just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula, and this seems to work very well so far: https://github.com/chezou/tabula-py This will convert your PDF table to a Pandas data frame. You can also set the area in x,y co-ordinates which is obviously very handy for irregular data.

Extract / Identify Tables from PDF python [closed]

Tags:

python

pdf

pdf-scraping

scrape

pdf-parsing

Are there any open source libraries that support table identification & extraction?

By this I mean:

Identify a table structure exists
Classify the table from its contents
Extract data from the table in a useful output format e.g. JSON / CSV etc.

I have looked through similar questions on this topic and found the following:

PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)
pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!

Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!

689

asked Feb 16 '15 00:02

Alexander McFarlane

3 Answers

After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry!

I hope you are using Linux;

pdftotext -layout NAME_OF_PDF.pdf

AMAZING!!

Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc..

It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!

175

answered Oct 18 '22 20:10

Ike

You should definitely have a look at this answer of mine:

Extracting table contents from a collection of PDF files

and also have a look at all the links included therein.

Tabula/TabulaPDF is currently the best table extraction tool that is available for PDF scraping.

answered Oct 18 '22 22:10

Kurt Pfeifle

I'd just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula, and this seems to work very well so far: https://github.com/chezou/tabula-py

This will convert your PDF table to a Pandas data frame. You can also set the area in x,y co-ordinates which is obviously very handy for irregular data.

answered Oct 18 '22 22:10

Ricky McMaster

Related questions
                            
                                Get screenshot on Windows with Python?
                            
                                Is it possible to kill a process on Windows from within Python?
                            
                                How do I do Debian packaging of a Python package?
                            
                                Add quotes to every list element
                            
                                python - os.getenv and os.environ don't see environment variables of my bash shell
                            
                                In Python, how can I put a thread to sleep until a specific time?
                            
                                Method Not Allowed flask error 405
                            
                                No module named 'virtualenvwrapper'
                            
                                Get all modules/packages used by a python project
                            
                                Exporting Data from google colab to local machine
                            
                                Mini-languages in Python
                            
                                The inheritance of attributes using __init__
                            
                                Adding 'install_requires' to setup.py when making a python package
                            
                                Generating a dense matrix from a sparse matrix in numpy python
                            
                                Python saving multiple figures into one PDF file
                            
                                Matrix from Python to MATLAB
                            
                                BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are
                            
                                SQLAlchemy or psycopg2?
                            
                                Using openCV to overlay transparent image onto another image
                            
                                Python: Regular expression to match alpha-numeric not working?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With