How do I use pdfminer as a library

Tags:

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.

I thought I was on to something when I found this link, but I didn't have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.

I also tried the function shown here, but it also did not work.

Another approach I tried was to call the script within a script using os.system. This was also unsuccessful.

I am using Python version 2.7.1 and pdfminer version 20110227.

788

asked Apr 20 '11 03:04

jmeich

1 Answers

Here is a new solution that works with the latest version:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from cStringIO import StringIO  def convert_pdf_to_txt(path):     rsrcmgr = PDFResourceManager()     retstr = StringIO()     codec = 'utf-8'     laparams = LAParams()     device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)     fp = file(path, 'rb')     interpreter = PDFPageInterpreter(rsrcmgr, device)     password = ""     maxpages = 0     caching = True     pagenos=set()     for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):         interpreter.process_page(page)     fp.close()     device.close()     str = retstr.getvalue()     retstr.close()     return str

answered Sep 22 '22 06:09

czw

Related questions
                            
                                python getoutput() equivalent in subprocess [duplicate]
                            
                                Difference between AbstractUser and AbstractBaseUser in Django?
                            
                                What is Python's heapq module?
                            
                                How do I document a module in Python?
                            
                                Lazy logger message string evaluation
                            
                                Pythonic way to sorting list of namedtuples by field name
                            
                                Create a day-of-week column in a Pandas dataframe using Python
                            
                                How can I normalize a URL in python
                            
                                How to add line based on slope and intercept in Matplotlib?
                            
                                Making decorators with optional arguments [duplicate]
                            
                                Installing a pip package from within a Jupyter Notebook not working
                            
                                Create own colormap using matplotlib and plot color scale
                            
                                FailedPreconditionError: Attempting to use uninitialized in Tensorflow
                            
                                convert json ipython notebook(.ipynb) to .py file
                            
                                Listing of all files in directory?
                            
                                Django: using <select multiple> and POST
                            
                                Why aren't Python sets hashable?
                            
                                How to implement retry mechanism into python requests library?
                            
                                User-friendly time format in Python?
                            
                                Find the end of the month of a Pandas DataFrame Series

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I use pdfminer as a library

Tags:

python

pdf

pdfminer

jmeich

People also ask

1 Answers

czw

Recent Activity

Donate For Us