Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyPdf unable to extract text from some pages in my PDF

Tags:

python

pdf

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:

http://www.4shared.com/document/kmJF67E4/forms.html

If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?

from pyPdf import PdfFileReader  
input = PdfFileReader(file("forms.pdf", "rb"))  
for page in input1.pages:  
    print page.extractText()  
like image 209
DrJAKing Avatar asked Nov 17 '10 10:11

DrJAKing


3 Answers

Note that extractText() still has problems extracting the text properly. From the documentation for extractText():

This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Since it is the text you want, you can use the Linux command pdftotext.

To invoke that using Python, you can do this:

>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])

The text is extracted from forms.pdf and saved to output.

This works in the case of your PDF file and extracts the text you want.

like image 78
user225312 Avatar answered Nov 15 '22 01:11

user225312


You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.

like image 33
Steven Avatar answered Nov 14 '22 23:11

Steven


I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.

like image 29
Danosaure Avatar answered Nov 14 '22 23:11

Danosaure