Search and replace for text within a pdf, in Python

Tags:

pdf

I am writing mailmerge software as part of a Python web app.

I have a template called letter.pdf which was generated from a MS Word file and includes the text {name} where the resident's name will go. I also have a list of c. 100 residents' names.

What I want to do is to read in letter.pdf do a search for "{name}" and replace it with the resident's name (for each resident) then write the result to another pdf. I then want to gather all these pdfs togetherinot a big pdf (one page per letter) which my web app's users will print out to create their letters.

Are there any Python libraries that will do this? I've looked and pdfrw and pdfminer but I couldn't see where they would be able to do it.

(NB: I also have the MS Word file, so if there was another way of using that not going through a pdf, that would also do the job.)

418

asked Jan 20 '17 17:01

Phil Hunt

2 Answers

This can be done with PyPDF2 package. The implementation may depend on the original PDF template structure. But if the template is stable enough and isn't changed very often the replacement code shouldn't be generic but rather simple.

I did a small sketch on how you could replace the text inside a PDF file. It replaces all occurrences of PDF tokens to DOC.

Click to copy

import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject


def replace_text(content, replacements = dict()):
    lines = content.splitlines()

    result = ""
    in_text = False

    for line in lines:
        if line == "BT":
            in_text = True

        elif line == "ET":
            in_text = False

        elif in_text:
            cmd = line[-2:]
            if cmd.lower() == 'tj':
                replaced_line = line
                for k, v in replacements.items():
                    replaced_line = replaced_line.replace(k, v)
                result += replaced_line + "\n"
            else:
                result += line + "\n"
            continue

        result += line + "\n"

    return result


def process_data(object, replacements):
    data = object.getData()
    decoded_data = data.decode('utf-8')

    replaced_data = replace_text(decoded_data, replacements)

    encoded_data = replaced_data.encode('utf-8')
    if object.decodedSelf is not None:
        object.decodedSelf.setData(encoded_data)
    else:
        object.setData(encoded_data)


if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--input", required=True, help="path to PDF document")
    args = vars(ap.parse_args())

    in_file = args["input"]
    filename_base = in_file.replace(os.path.splitext(in_file)[1], "")

    # Provide replacements list that you need here
    replacements = { 'PDF': 'DOC'}

    pdf = PdfFileReader(in_file)
    writer = PdfFileWriter()

    for page_number in range(0, pdf.getNumPages()):

        page = pdf.getPage(page_number)
        contents = page.getContents()

        if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
            process_data(contents, replacements)
        elif len(contents) > 0:
            for obj in contents:
                if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
                    streamObj = obj.getObject()
                    process_data(streamObj, replacements)

        writer.addPage(page)

    with open(filename_base + ".result.pdf", 'wb') as out_file:
        writer.write(out_file)

The results are

Original PDF Replaced PDF

UPDATE 2021-03-21:

Updated the code example to handle DecodedStreamObject and EncodedStreamObject which actually contian data stream with text to update.

171

answered Sep 22 '22 00:09

Dmytro

If @Dmytrio solution do not alter final PDF

Dymitrio's updated code example to handle DecodedStreamObject and EncodedStreamObject which actually contain data stream with text to update could run fine, but with a file different from example, was not able to alter pdf text content.

According to EDIT 3, from How to replace text in a PDF using Python?:

By inserting page[NameObject("/Contents")] = contents.decodedSelf before writer.addPage(page), we force pyPDF2 to update content of the page object.

This way I was able to overcome this problem and replace text from pdf file.

Final code should look like this:

Click to copy

import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject, NameObject


def replace_text(content, replacements = dict()):
    lines = content.splitlines()

    result = ""
    in_text = False

    for line in lines:
        if line == "BT":
            in_text = True

        elif line == "ET":
            in_text = False

        elif in_text:
            cmd = line[-2:]
            if cmd.lower() == 'tj':
                replaced_line = line
                for k, v in replacements.items():
                    replaced_line = replaced_line.replace(k, v)
                result += replaced_line + "\n"
            else:
                result += line + "\n"
            continue

        result += line + "\n"

    return result


def process_data(object, replacements):
    data = object.getData()
    decoded_data = data.decode('utf-8')

    replaced_data = replace_text(decoded_data, replacements)

    encoded_data = replaced_data.encode('utf-8')
    if object.decodedSelf is not None:
        object.decodedSelf.setData(encoded_data)
    else:
        object.setData(encoded_data)


if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--input", required=True, help="path to PDF document")
    args = vars(ap.parse_args())

    in_file = args["input"]
    filename_base = in_file.replace(os.path.splitext(in_file)[1], "")

    # Provide replacements list that you need here
    replacements = { 'PDF': 'DOC'}

    pdf = PdfFileReader(in_file)
    writer = PdfFileWriter()

    for page_number in range(0, pdf.getNumPages()):

        page = pdf.getPage(page_number)
        contents = page.getContents()

        if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
            process_data(contents, replacements)
        elif len(contents) > 0:
            for obj in contents:
                if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
                    streamObj = obj.getObject()
                    process_data(streamObj, replacements)

        # Force content replacement
        page[NameObject("/Contents")] = contents.decodedSelf
        writer.addPage(page)

    with open(filename_base + ".result.pdf", 'wb') as out_file:
        writer.write(out_file)

Important: from PyPDF2.generic import NameObject

answered Sep 19 '22 00:09

Vladimir Simoes da Luz Junior

Related questions
                            
                                Pandas count consecutive date observations within groupby object
                            
                                What's the point of @staticmethod in Python?
                            
                                Debugging TensorFlow tests: pdb or gdb?
                            
                                How to use a python library that is constantly changing in a docker image or new container?
                            
                                Redefining python built-in function
                            
                                pandas - Selecting pair of consecutive rows matching criteria
                            
                                Fill na values by adding x to previous row pandas
                            
                                how to fill missing values with a tuple
                            
                                Check if tuple contains at least one of multiple values [duplicate]
                            
                                Array of Enum in Postgres with SQLAlchemy
                            
                                Resize Vertical Header of QTableView in PyQt4?
                            
                                Scroll in Selenium Webdriver (Python)
                            
                                Pandas - aggregate, sort and nlargest inside groupby
                            
                                Python multithreading list append gives unexpected results
                            
                                Bjoern v/s Gunicorn POST requests
                            
                                Highlighting multiple cells in different colors with Pandas
                            
                                How to reproduce UnicodeEncodeError?
                            
                                Python subprocess running in background before returning output
                            
                                Interactive boxplot with pandas and Jupyter notebook
                            
                                Converting time to epoch (Python) [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With