finding on which page a search string is located in a pdf document using python

Tags:

Which python packages can I use to find out out on which page a specific “search string” is located ?

I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task. Any advice ?

More precise: I have several PDF documents and I would like to extract pages which are between a string “Begin” and a string “End” .

342

asked Sep 24 '12 19:09

user1043144

1 Answers

I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.

(1) a function to locate the string

def fnPDF_FindText(xFile, xString):
    # xfile : the PDF file in which to look
    # xString : the string to look for
    import pyPdf, re
    PageFound = -1
    pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
    for i in range(0, pdfDoc.getNumPages()):
        content = ""
        content += pdfDoc.getPage(i).extractText() + "\n"
        content1 = content.encode('ascii', 'ignore').lower()
        ResSearch = re.search(xString, content1)
        if ResSearch is not None:
           PageFound = i
           break
     return PageFound

(2) a function to extract the pages of interest

  def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
      from pyPdf import PdfFileReader, PdfFileWriter
      output = PdfFileWriter()
      pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
      for i in range(xPageStart, xPageEnd):
          output.addPage(pdfOne.getPage(i))
          outputStream = file(xFileNameOutput, "wb")
          output.write(outputStream)
          outputStream.close()

I hope this will be helpful to somebody else

answered Nov 05 '22 14:11

user1043144

Related questions
                            
                                python struct unpack into a dict
                            
                                Get just a class name without module, etc [duplicate]
                            
                                Intercept event when combobox edited
                            
                                How to make SMTPHandler not block
                            
                                How to store data like Freebase does?
                            
                                Building a huge numpy array using pytables
                            
                                how to render a Queryset into a table template-django
                            
                                The _imaging C module is not installed (on windows)
                            
                                how to delete a key from a dictionary with the highest value?
                            
                                Where is the default parameter in Python function
                            
                                Python convert long to date
                            
                                Anisotropic diffusion 2d images [closed]
                            
                                Two assignments in single python list comprehension
                            
                                How to write a python function that adds all arguments?
                            
                                Why doesn't var = [0].extend(range(1,10)) work in python?
                            
                                Json string formatting with python
                            
                                Detect mouseover an image in Pygame
                            
                                Combine Two LIsts in Unique Way in Python
                            
                                Python: string to a list of lists
                            
                                Can I mix character classes in Python RegEx?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

finding on which page a search string is located in a pdf document using python

Tags:

python

pdf

pypdf2

pypdf

user1043144

People also ask

1 Answers

user1043144

Recent Activity

Donate For Us