Python module for converting PDF to text [closed]

2 Answers

Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format.

The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text.

A Python 3 version is available under:

https://github.com/pdfminer/pdfminer.six

124

answered Oct 14 '22 05:10

David Crow

The PDFMiner package has changed since codeape posted.

EDIT (again):

PDFMiner has been updated again in version 20100213

You can check the version you have installed with the following:

>>> import pdfminer >>> pdfminer.__version__ '20100213'

Here's the updated version (with comments on what I changed/added):

def pdf_to_csv(filename):     from cStringIO import StringIO  #<-- added so you can copy/paste this to try it     from pdfminer.converter import LTTextItem, TextConverter     from pdfminer.pdfparser import PDFDocument, PDFParser     from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter      class CsvConverter(TextConverter):         def __init__(self, *args, **kwargs):             TextConverter.__init__(self, *args, **kwargs)          def end_page(self, i):             from collections import defaultdict             lines = defaultdict(lambda : {})             for child in self.cur_item.objs:                 if isinstance(child, LTTextItem):                     (_,_,x,y) = child.bbox                   #<-- changed                     line = lines[int(-y)]                     line[x] = child.text.encode(self.codec)  #<-- changed              for y in sorted(lines.keys()):                 line = lines[y]                 self.outfp.write(";".join(line[x] for x in sorted(line.keys())))                 self.outfp.write("\n")      # ... the following part of the code is a remix of the      # convert() function in the pdfminer/tools/pdf2text module     rsrc = PDFResourceManager()     outfp = StringIO()     device = CsvConverter(rsrc, outfp, codec="utf-8")  #<-- changed          # becuase my test documents are utf-8 (note: utf-8 is the default codec)      doc = PDFDocument()     fp = open(filename, 'rb')     parser = PDFParser(fp)       #<-- changed     parser.set_document(doc)     #<-- added     doc.set_parser(parser)       #<-- added     doc.initialize('')      interpreter = PDFPageInterpreter(rsrc, device)      for i, page in enumerate(doc.get_pages()):         outfp.write("START PAGE %d\n" % i)         interpreter.process_page(page)         outfp.write("END PAGE %d\n" % i)      device.close()     fp.close()      return outfp.getvalue()

Edit (yet again):

Here is an update for the latest version in pypi, 20100619p1. In short I replaced LTTextItem with LTChar and passed an instance of LAParams to the CsvConverter constructor.

def pdf_to_csv(filename):     from cStringIO import StringIO       from pdfminer.converter import LTChar, TextConverter    #<-- changed     from pdfminer.layout import LAParams     from pdfminer.pdfparser import PDFDocument, PDFParser     from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter      class CsvConverter(TextConverter):         def __init__(self, *args, **kwargs):             TextConverter.__init__(self, *args, **kwargs)          def end_page(self, i):             from collections import defaultdict             lines = defaultdict(lambda : {})             for child in self.cur_item.objs:                 if isinstance(child, LTChar):               #<-- changed                     (_,_,x,y) = child.bbox                                        line = lines[int(-y)]                     line[x] = child.text.encode(self.codec)              for y in sorted(lines.keys()):                 line = lines[y]                 self.outfp.write(";".join(line[x] for x in sorted(line.keys())))                 self.outfp.write("\n")      # ... the following part of the code is a remix of the      # convert() function in the pdfminer/tools/pdf2text module     rsrc = PDFResourceManager()     outfp = StringIO()     device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())  #<-- changed         # becuase my test documents are utf-8 (note: utf-8 is the default codec)      doc = PDFDocument()     fp = open(filename, 'rb')     parser = PDFParser(fp)            parser.set_document(doc)          doc.set_parser(parser)            doc.initialize('')      interpreter = PDFPageInterpreter(rsrc, device)      for i, page in enumerate(doc.get_pages()):         outfp.write("START PAGE %d\n" % i)         if page is not None:             interpreter.process_page(page)         outfp.write("END PAGE %d\n" % i)      device.close()     fp.close()      return outfp.getvalue()

EDIT (one more time):

Updated for version 20110515 (thanks to Oeufcoque Penteano!):

def pdf_to_csv(filename):     from cStringIO import StringIO       from pdfminer.converter import LTChar, TextConverter     from pdfminer.layout import LAParams     from pdfminer.pdfparser import PDFDocument, PDFParser     from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter      class CsvConverter(TextConverter):         def __init__(self, *args, **kwargs):             TextConverter.__init__(self, *args, **kwargs)          def end_page(self, i):             from collections import defaultdict             lines = defaultdict(lambda : {})             for child in self.cur_item._objs:                #<-- changed                 if isinstance(child, LTChar):                     (_,_,x,y) = child.bbox                                        line = lines[int(-y)]                     line[x] = child._text.encode(self.codec) #<-- changed              for y in sorted(lines.keys()):                 line = lines[y]                 self.outfp.write(";".join(line[x] for x in sorted(line.keys())))                 self.outfp.write("\n")      # ... the following part of the code is a remix of the      # convert() function in the pdfminer/tools/pdf2text module     rsrc = PDFResourceManager()     outfp = StringIO()     device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())         # becuase my test documents are utf-8 (note: utf-8 is the default codec)      doc = PDFDocument()     fp = open(filename, 'rb')     parser = PDFParser(fp)            parser.set_document(doc)          doc.set_parser(parser)            doc.initialize('')      interpreter = PDFPageInterpreter(rsrc, device)      for i, page in enumerate(doc.get_pages()):         outfp.write("START PAGE %d\n" % i)         if page is not None:             interpreter.process_page(page)         outfp.write("END PAGE %d\n" % i)      device.close()     fp.close()      return outfp.getvalue()

answered Oct 14 '22 04:10

tgray

Related questions
                            
                                What are the differences between json and simplejson Python modules?
                            
                                How to reversibly store and load a Pandas dataframe to/from disk
                            
                                Installing python module within code
                            
                                How do I detect the Python version at runtime? [duplicate]
                            
                                Pylint, PyChecker or PyFlakes? [closed]
                            
                                How to specify multiple return types using type-hints
                            
                                How do I get a value of datetime.today() in Python that is "timezone aware"?
                            
                                Stripping everything but alphanumeric chars from a string in Python
                            
                                What are the differences between numpy arrays and matrices? Which one should I use?
                            
                                Get lengths of a list in a jinja2 template
                            
                                What is the best way to call a script from another script?
                            
                                Timeout on a function call
                            
                                What is the best way to repeatedly execute a function every x seconds? [closed]
                            
                                TypeError: method() takes 1 positional argument but 2 were given
                            
                                Python Pandas: Get index of rows which column matches certain value
                            
                                Split by comma and strip whitespace in Python
                            
                                Generating an MD5 checksum of a file
                            
                                JSON datetime between Python and JavaScript
                            
                                Does Python support short-circuiting?
                            
                                How to correct TypeError: Unicode-objects must be encoded before hashing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python module for converting PDF to text [closed]

Tags:

python

pdf

text-extraction

pdf-scraping

cnu

People also ask

2 Answers

David Crow

tgray

Recent Activity

Donate For Us