Watermark Removal on PDF with PyPDF2

Tags:

# This Section imports the necessary classes from the PyPDF2 library

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import ContentStream, NameObject, TextStringObject
from PyPDF2.utils import b_

# The watermark says SAMPLE on it so I've tried different
# capitalization cases
wm_text = "Sample"
replace_with = ""
# I'm hoping to just replace the SAMPLE watermark with nothing
# so a space could suffice

# Load PDF into pyPDF
reader = PdfFileReader("input.pdf")
writer = PdfFileWriter()

for page in reader.pages:
    # Get the current page's contents
    content_object = page["/Contents"].getObject()
    content = ContentStream(content_object, reader)

    # Loop over all pdf elements
    for operands, operator in content.operations:

        # Was told to adapt this part dependent on my PDF file
        if operator == b_("TJ"):
            text = operands[0][0]
            if isinstance(text, TextStringObject) and text.startswith(
                wm_text
            ):
                operands[0] = TextStringObject(replace_with)

    # Set the modified content as content object on the page
    page.__setitem__(NameObject("/Contents"), content)

    # Add the page to the output
    writer.addPage(page)

# Write the stream
with open("output.pdf", "wb") as fh:
    writer.write(fh)

225

asked Jun 10 '16 16:06

Shane G.

1 Answers

Using the code from the question here is a function that works in Python 3.

def remove_watermark(wm_text, inputFile, outputFile):
    from PyPDF4 import PdfFileReader, PdfFileWriter
    from PyPDF4.pdf import ContentStream
    from PyPDF4.generic import TextStringObject, NameObject
    from PyPDF4.utils import b_
    
    with open(inputFile, "rb") as f:
        source = PdfFileReader(f, "rb")
        output = PdfFileWriter()

        for page in range(source.getNumPages()):
            page = source.getPage(page)
            content_object = page["/Contents"].getObject()
            content = ContentStream(content_object, source)

            for operands, operator in content.operations:
                if operator == b_("Tj"):
                    text = operands[0]

                    if isinstance(text, str) and text.startswith(wm_text):
                        operands[0] = TextStringObject('')

            page.__setitem__(NameObject('/Contents'), content)
            output.addPage(page)

        with open(outputFile, "wb") as outputStream:
            output.write(outputStream)
            
wm_text = 'wm_text'
inputFile = r'input.pdf'
outputFile = r"output.pdf"
remove_watermark(wm_text, inputFile, outputFile)

175

answered Oct 25 '22 17:10

faysou

Related questions
                            
                                xlsxwriter: How to insert a new row
                            
                                Tricky filling holes in an image
                            
                                Override Flask-Security's /login endpoint
                            
                                Is there a clean way to write a one-line help per choice for argparse choices?
                            
                                Given a set of points defined in (X, Y, Z) coordinates, interpolate Z-value at arbitrary (X, Y)
                            
                                Why no __getitem__ raises TypeError
                            
                                In Celery are there significant performance implications of using many queues
                            
                                Pandas - Rolling window - uneven interval
                            
                                List of List to Key-Value Pairs
                            
                                Strategy pattern in Python when a "strategy" consists of more than one function
                            
                                Read random sample of files on S3 with Pyspark
                            
                                Python PyInstaller and include window icon
                            
                                Releasing for Ubuntu
                            
                                python No module named ujson, while it's already installed
                            
                                Code optimization - number of function calls in Python
                            
                                How to train a model in C++ with tensorflow?
                            
                                Replace division by zero numpy
                            
                                Access Pandas Data Frame row with index value
                            
                                Writing to a text file error - Must be str, not list
                            
                                Schedule python scripts to run in AWS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Watermark Removal on PDF with PyPDF2

Tags:

python

pdf

watermark

pypdf2

Shane G.

People also ask

1 Answers

faysou

Recent Activity

Donate For Us