Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use pdfminer as a library

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.

I thought I was on to something when I found this link, but I didn't have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.

I also tried the function shown here, but it also did not work.

Another approach I tried was to call the script within a script using os.system. This was also unsuccessful.

I am using Python version 2.7.1 and pdfminer version 20110227.

like image 788
jmeich Avatar asked Apr 20 '11 03:04

jmeich


People also ask

What is PDFResourceManager?

PDFResourceManager is used to store shared resources such as fonts or images.


1 Answers

Here is a new solution that works with the latest version:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from cStringIO import StringIO  def convert_pdf_to_txt(path):     rsrcmgr = PDFResourceManager()     retstr = StringIO()     codec = 'utf-8'     laparams = LAParams()     device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)     fp = file(path, 'rb')     interpreter = PDFPageInterpreter(rsrcmgr, device)     password = ""     maxpages = 0     caching = True     pagenos=set()     for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):         interpreter.process_page(page)     fp.close()     device.close()     str = retstr.getvalue()     retstr.close()     return str 
like image 68
czw Avatar answered Sep 22 '22 06:09

czw