How to unlock a "secured" (read-protected) PDF in Python?

Tags:

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages     raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1 ab0>

When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.

if check_extractable and not doc.is_extractable:     raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)

Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable is a simple attribute of the doc, but I don't think it is as simple as changing .is_extractable to True..

Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!

================================================

Below you will find the code with which I currently extract the text from non-read protected.

def getTextFromPDF(rawFile):     resourceManager = PDFResourceManager(caching=True)     outfp = StringIO()     device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)     interpreter = PDFPageInterpreter(resourceManager, device)      fileData = StringIO()     fileData.write(rawFile)     for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):         interpreter.process_page(page)     fileData.close()     device.close()      result = outfp.getvalue()      outfp.close()     return result

840

asked Jan 28 '15 13:01

kramer65

1 Answers

I had some issues trying to get qpdf to behave in my program. I found a useful library, pikepdf, that is based on qpdf and automatically converts pdfs to be extractable.

The code to use this is pretty straightforward:

import pikepdf  pdf = pikepdf.open('unextractable.pdf') pdf.save('extractable.pdf')

123

answered Sep 16 '22 11:09

IanJ

Related questions
                            
                                How to extract all columns but one from an array (or matrix) in python?
                            
                                Clustering text documents using scikit-learn kmeans in Python
                            
                                Pandas - replacing column values
                            
                                I know Perl 5. What are the advantages of learning Perl 6, rather than moving to Python? [closed]
                            
                                How to sort a Pandas DataFrame according to multiple criteria?
                            
                                How to format raw string with different expressions inside?
                            
                                PyQt5 cannot import name 'QApplication'
                            
                                Calling a python function from bash script
                            
                                Setting smaller buffer size for sys.stdin?
                            
                                Repeat-until or equivalent loop in Python [duplicate]
                            
                                RuntimeError: Attempting to deserialize object on a CUDA device
                            
                                AttributeError: 'ElementTree' object has no attribute 'getiterator' when trying to import excel file
                            
                                Getting a python virtual env error after installing Lion
                            
                                Remove traceback in Python on Ctrl-C
                            
                                Python: Why should I use next() and not obj.next()?
                            
                                sqlite3.Warning: You can only execute one statement at a time
                            
                                psycopg2 leaking memory after large query
                            
                                getting list without k'th element efficiently and non-destructively
                            
                                Creating a dictionary from a CSV file
                            
                                Getting the r-squared value using curve_fit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to unlock a "secured" (read-protected) PDF in Python?

Tags:

python

pdf

pdf-scraping

pdfminer

kramer65

People also ask

1 Answers

IanJ

Recent Activity

Donate For Us