In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1 ab0>
When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.
if check_extractable and not doc.is_extractable: raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable
is a simple attribute of the doc
, but I don't think it is as simple as changing .is_extractable
to True..
Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!
================================================
Below you will find the code with which I currently extract the text from non-read protected.
def getTextFromPDF(rawFile): resourceManager = PDFResourceManager(caching=True) outfp = StringIO() device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None) interpreter = PDFPageInterpreter(resourceManager, device) fileData = StringIO() fileData.write(rawFile) for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True): interpreter.process_page(page) fileData.close() device.close() result = outfp.getvalue() outfp.close() return result
How to unlock a PDF to remove password security: Open the PDF in Acrobat. Use the “Unlock” tool: Choose “Tools” > “Protect” > “Encrypt” > “Remove Security.”
Adobe Acrobat Pro X)Choose the Secure drop down Menu • Select Remove Security Page 2 • Choose the File drop down menu and select Save As to save the document in a location where you can upload your eFiling document.
Because the document will be encrypted with a password, you will have to enter the correct password to make it print-ready. Then go to the security option, click on the delete password option, and finally click o the save option. Then you will have to click on the allow different print level option.
I had some issues trying to get qpdf to behave in my program. I found a useful library, pikepdf, that is based on qpdf and automatically converts pdfs to be extractable.
The code to use this is pretty straightforward:
import pikepdf pdf = pikepdf.open('unextractable.pdf') pdf.save('extractable.pdf')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With