Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text from PDF

Tags:

python

pdf

I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all the tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc?

Thanks.

like image 366
Mridang Agarwalla Avatar asked Jun 30 '10 11:06

Mridang Agarwalla


People also ask

Can you extract text from a PDF image?

You can capture text from a scanned image, upload your image file from your computer, or take a screenshot on your desktop. Then simply right click on the image, and select Grab Text. The text from your scanned PDF can then be copied and pasted into other programs and applications.

How do I extract text from a PDF and keep formatting?

Copy selected text Choose Edit > Copy to copy the selected text to another application. Right-click on the selected text, and then select Copy. Right-click on the selected text, and then choose Copy With Formatting.

How do I cut text from a PDF?

Keyboard Commands: Select your text while in the editor. Hold down the CTRL key and press X to cut. Hold down the CTRL key and press C to copy.


1 Answers

PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://www.jpedal.org/PDFblog/2009/04/pdf-text/

like image 102
mark stephens Avatar answered Oct 06 '22 01:10

mark stephens