Pdfminer python 3.5

Tags:

I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?)

I am unsure why 'banana' is printing nothing, I think the errors might be red herrings? is it something to do with me following a python2.7 tutorial and trying to translate it to python3?

errors: File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 28, in <module>     banana = convert("A1.pdf")   File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 19, in convert     infile = file(fname, 'rb') NameError: name 'file' is not defined

script

from io import BytesIO  from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage  def convert(fname, pages=None):     if not pages:         pagenums = set()     else:         pagenums = set(pages)      output = BytesIO()     manager = PDFResourceManager()     converter = TextConverter(manager, output, laparams=LAParams())     interpreter = PDFPageInterpreter(manager, converter)      infile = file(fname, 'rb')     for page in PDFPage.get_pages(infile, pagenums):         interpreter.process_page(page)     infile.close()     converter.close()     text = output.getvalue()     output.close     return text  banana = convert("A1.pdf") print(banana)

The same thing happens with this variant:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import BytesIO  def convert_pdf_to_txt(path):     rsrcmgr = PDFResourceManager()     retstr = BytesIO()     codec = 'utf-8'     laparams = LAParams()     device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)     fp = file(path, 'rb')     interpreter = PDFPageInterpreter(rsrcmgr, device)     password = ""     maxpages = 0     caching = True     pagenos=set()      for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):         interpreter.process_page(page)      text = retstr.getvalue()      fp.close()     device.close()     retstr.close()     return text  Banana = convert_pdf_to_txt("A1.pdf") print(Banana)

I have tried searching for this (most of the pdfminer code is from this or this) but having no luck.

Any insight is appreciated.

Cheers

824

asked Oct 04 '16 14:10

gary

Video Answer

1 Answers

There is a solution for Python 3.5: you need pdfminer.six. Under win10 I could easy install it with

pip install pdfminer.six

You can check the installed version with

pdfminer.__version__

I haven't tested it intensively yet. But I could run the following code for the conversion pdf→text and pdf→html

189

answered Sep 29 '22 05:09

pyano

Related questions
                            
                                Changing router-outlet with *ngIf in app.component.html in angular2
                            
                                Is ordering of keys and values preserved in Elixir when you operate on a map?
                            
                                Difference between Array, Set and Dictionary in Swift
                            
                                Unable to use Arrow functions inside React component class [duplicate]
                            
                                How to increase width of textfield according to typed text?
                            
                                How to prevent auto-backup of an Android app?
                            
                                Knitr ignoring fig.pos?
                            
                                Slice signatures are inconsistent with android studio default run
                            
                                Applying a function along a numpy array
                            
                                How do I extract a type from an array in typescript?
                            
                                Getting the # of days difference between two dates in Powershell
                            
                                When should I use camelCase / Camel Case or underscores in PHP naming?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With