Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier. I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs. Can anyone help me with an efficient python 3.6 code to solve the same? Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that. Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it? https://hartmannazurecdn.azureedge.net/media/2369/annual-report-2017.pdf

For extracting tables you can use camelot Here is an article about it. For images I've found this question and answer <a href="https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python">Extract images from PDF without resampling, in python? </a>

How to extract charts/tables/graphs from PDF files using Python?

1 Answers

For extracting tables you can use camelot

Here is an article about it.

For images I've found this question and answer Extract images from PDF without resampling, in python?

answered Oct 20 '22 11:10

milonimrod

Related questions
                            
                                Pyspark and local variables inside UDFs
                            
                                Slow loading SQL Server table into pandas DataFrame
                            
                                Get sqlalchemy base class object instead of children
                            
                                Why is it faster to read whole hdf5 dataset than a slice
                            
                                Python 3.7: Inheriting list, abstract ignored [duplicate]
                            
                                NEAT algorithm result precision
                            
                                Use the Eigen library with cppyy
                            
                                How do I get PEP 484 type hints for flask_sqlalchemy classes?
                            
                                How to configure tensorflow legacy/train.py model.cpk output interval
                            
                                Assign indexed entry of Keras tensor
                            
                                Determine a file's path(s) relative to a directory, including symlinks
                            
                                Mayavi how to show the axes grid
                            
                                Django rest framework best way to validate POST request parameters
                            
                                How to add "greater than 0 and sums to 1" constraint to a regression in Python?
                            
                                How to fix " Timeout when reading response headers from daemon process" error when using WSGI with Django and Apache
                            
                                How to Deallocate memory from an object in Jupyter Notebook
                            
                                In Python PDB, how do I list the source code of a file other than the current file?
                            
                                Lambda calling Lambda - how to access the payload in the second?
                            
                                Matplotlib not working for LInux. Cannot load backend 'TkAgg'
                            
                                How to implement Beholder (Tensorboard plugin) for Keras?

How to extract charts/tables/graphs from PDF files using Python?

Tags:

python

pdf

extract

python-3.6

ocr

Aakash Basu

People also ask

1 Answers

milonimrod

Recent Activity

Donate For Us