Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text from a PDF file using PDFMiner in python?

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.

It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.

As it is, I'm just looking at source-code to see if I can figure it out.

like image 892
RattleyCooper Avatar asked Oct 21 '14 18:10

RattleyCooper


People also ask

How do I extract text from a PDF using PDFMiner in Python?

To extract text from a PDF file using PDFMiner in Python, we can open the PDF file and then we use TextConverter to convert the text into a string. to open the example. pdf file with open . Then we create the PDFParser object with the in_file .

What is PDFMiner in Python?

PDFMiner is a text extraction tool for PDF documents.

How do I scrape text in a PDF?

Use Adobe Acrobat Professional. To extract information from a PDF in Acrobat DC, choose Tools > Export PDF and select an option. To extract text, export the PDF to a Word format or rich text format, and choose from several advanced options that include: Retain Flowing Text.


2 Answers

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO  def convert_pdf_to_txt(path):     rsrcmgr = PDFResourceManager()     retstr = StringIO()     codec = 'utf-8'     laparams = LAParams()     device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)     fp = open(path, 'rb')     interpreter = PDFPageInterpreter(rsrcmgr, device)     password = ""     maxpages = 0     caching = True     pagenos=set()      for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):         interpreter.process_page(page)      text = retstr.getvalue()      fp.close()     device.close()     retstr.close()     return text 

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.

Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x

Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.

like image 142
RattleyCooper Avatar answered Sep 21 '22 06:09

RattleyCooper


terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:

import io  from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage   def convert_pdf_to_txt(path):     rsrcmgr = PDFResourceManager()     retstr = io.StringIO()     codec = 'utf-8'     laparams = LAParams()     device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)     fp = open(path, 'rb')     interpreter = PDFPageInterpreter(rsrcmgr, device)     password = ""     maxpages = 0     caching = True     pagenos = set()      for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,                                   password=password,                                   caching=caching,                                   check_extractable=True):         interpreter.process_page(page)        fp.close()     device.close()     text = retstr.getvalue()     retstr.close()     return text 
like image 43
juan Isaza Avatar answered Sep 19 '22 06:09

juan Isaza