Extracting text from a PDF file using PDFMiner in python?

Tags:

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.

It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.

As it is, I'm just looking at source-code to see if I can figure it out.

892

asked Oct 21 '14 18:10

RattleyCooper

2 Answers

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO  def convert_pdf_to_txt(path):     rsrcmgr = PDFResourceManager()     retstr = StringIO()     codec = 'utf-8'     laparams = LAParams()     device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)     fp = open(path, 'rb')     interpreter = PDFPageInterpreter(rsrcmgr, device)     password = ""     maxpages = 0     caching = True     pagenos=set()      for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):         interpreter.process_page(page)      text = retstr.getvalue()      fp.close()     device.close()     retstr.close()     return text

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.

Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x

Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.

142

answered Sep 21 '22 06:09

RattleyCooper

terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:

import io  from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage   def convert_pdf_to_txt(path):     rsrcmgr = PDFResourceManager()     retstr = io.StringIO()     codec = 'utf-8'     laparams = LAParams()     device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)     fp = open(path, 'rb')     interpreter = PDFPageInterpreter(rsrcmgr, device)     password = ""     maxpages = 0     caching = True     pagenos = set()      for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,                                   password=password,                                   caching=caching,                                   check_extractable=True):         interpreter.process_page(page)        fp.close()     device.close()     text = retstr.getvalue()     retstr.close()     return text

answered Sep 19 '22 06:09

juan Isaza

Related questions
                            
                                How to add a custom CA Root certificate to the CA Store used by pip in Windows?
                            
                                How can I read the contents of an URL with Python?
                            
                                Check if object is file-like in Python
                            
                                How do I initialize a dictionary of empty lists in Python?
                            
                                OpenCV giving wrong color to colored images on loading
                            
                                Pycharm: run only part of my Python file
                            
                                How to install PIP on Python 3.6?
                            
                                ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
                            
                                How do I use vi keys in ipython under *nix?
                            
                                How can you print a variable name in python? [duplicate]
                            
                                in Ipython notebook / Jupyter, Pandas is not displaying the graph I try to plot
                            
                                Django Rest Framework -- no module named rest_framework
                            
                                How to change the Spyder editor background to dark?
                            
                                Python dictionary get multiple values
                            
                                Does Flask support regular expressions in its URL routing?
                            
                                Sort a list of lists with a custom compare function
                            
                                Interleave multiple lists of the same length in Python
                            
                                How to force the Y axis to only use integers in Matplotlib? [duplicate]
                            
                                How to git commit nothing without an error?
                            
                                How do I delete a column that contains only zeros in Pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting text from a PDF file using PDFMiner in python?

Tags:

python

python-3.x

python-2.7

text-extraction

pdfminer

RattleyCooper

People also ask

2 Answers

RattleyCooper

juan Isaza

Recent Activity

Donate For Us