Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to unlock a "secured" (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages     raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1 ab0> 

When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.

if check_extractable and not doc.is_extractable:     raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) 

Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable is a simple attribute of the doc, but I don't think it is as simple as changing .is_extractable to True..

Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!

================================================

Below you will find the code with which I currently extract the text from non-read protected.

def getTextFromPDF(rawFile):     resourceManager = PDFResourceManager(caching=True)     outfp = StringIO()     device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)     interpreter = PDFPageInterpreter(resourceManager, device)      fileData = StringIO()     fileData.write(rawFile)     for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):         interpreter.process_page(page)     fileData.close()     device.close()      result = outfp.getvalue()      outfp.close()     return result 
like image 840
kramer65 Avatar asked Jan 28 '15 13:01

kramer65


People also ask

How do I permanently unlock a secured PDF?

How to unlock a PDF to remove password security: Open the PDF in Acrobat. Use the “Unlock” tool: Choose “Tools” > “Protect” > “Encrypt” > “Remove Security.”

How do I unsecure an encrypted PDF?

Adobe Acrobat Pro X)Choose the Secure drop down Menu • Select Remove Security Page 2 • Choose the File drop down menu and select Save As to save the document in a location where you can upload your eFiling document.

How do I unlock a print protected PDF?

Because the document will be encrypted with a password, you will have to enter the correct password to make it print-ready. Then go to the security option, click on the delete password option, and finally click o the save option. Then you will have to click on the allow different print level option.


1 Answers

I had some issues trying to get qpdf to behave in my program. I found a useful library, pikepdf, that is based on qpdf and automatically converts pdfs to be extractable.

The code to use this is pretty straightforward:

import pikepdf  pdf = pikepdf.open('unextractable.pdf') pdf.save('extractable.pdf') 
like image 123
IanJ Avatar answered Sep 16 '22 11:09

IanJ