Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract charts/tables/graphs from PDF files using Python?

Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.

I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.

Can anyone help me with an efficient python 3.6 code to solve the same?

Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.

Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?

https://hartmannazurecdn.azureedge.net/media/2369/annual-report-2017.pdf

like image 853
Aakash Basu Avatar asked Apr 29 '19 08:04

Aakash Basu


People also ask

Can I extract tables from PDF?

There are a variety of methods you can use to extract tables from a PDF file and use them in your spreadsheets. You can use Excel and Power BI to extract and import tables from PDF into your spreadsheet as formatted tables. Alternatively, you can also use Adobe Acrobat DC to export your PDF as an Excel workbook file.


1 Answers

For extracting tables you can use camelot

Here is an article about it.

For images I've found this question and answer Extract images from PDF without resampling, in python?

like image 67
milonimrod Avatar answered Oct 20 '22 11:10

milonimrod