Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting entire pdf data with python pdfminer

I am using pdfminer to extract data from pdf files using python. I would like to extract all the data present in pdf irrespective of wheather it is an image or text or whatever it is. Can we do that in a single line(or two if needed, without much work). Any help is appreciated. Thanks in advance

like image 273
sunil reddy Avatar asked Jan 14 '23 04:01

sunil reddy


1 Answers

Can we do that in a single line(or two if needed, without much work).

No, you cannot. Pdfminer is powerful but it's rather low-level.

Unfortunately, the documentation is not exactly exhaustive. I was able to find my way around it thanks to some code by Denis Papathanasiou. The code is discussed in his blog, and you can find the source here: layout_scanner.py

See also this answer, where I give a little more detail.

like image 126
alexis Avatar answered Jan 21 '23 10:01

alexis