Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.
I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.
Can anyone help me with an efficient python 3.6 code to solve the same?
Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.
Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?
https://hartmannazurecdn.azureedge.net/media/2369/annual-report-2017.pdf
There are a variety of methods you can use to extract tables from a PDF file and use them in your spreadsheets. You can use Excel and Power BI to extract and import tables from PDF into your spreadsheet as formatted tables. Alternatively, you can also use Adobe Acrobat DC to export your PDF as an Excel workbook file.
For extracting tables you can use camelot
Here is an article about it.
For images I've found this question and answer Extract images from PDF without resampling, in python?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With