Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reduce size of modified PDF using pymupdf

Tags:

python

pymupdf

I'm editing a pdf by redacting certain words and adding different words on top of the redacted area in pymupdf.

The code works successfully however it makes a very large single page pdf (9MB). I assume this is because of drawing many shapes and redactions but I can't seem to refactor.

I know from this post that I shouldn't be applying page.apply_redactions() more than once but if I don't the text doesn't correctly display on top of the redacted square, or it raises ValueError: fill rect must be finite and not empty.

Any help in refactoring for a smaller output pdf would be much appreciated.

    doc = fitz.open(self.path) 
    # get pdf background colour
    col = fitz.utils.getColor("py_color")
    # iterating through pages 
    for page in doc: 

        page.wrap_contents()
        # geting the rect boxes which consists the matching regex 
        sensitive = self.get_sensitive_data(page.getText("text") 
                                            .split('\n')) 
        for data in sensitive: 
            areas = page.searchFor(data) 
            for area in areas:
                text_page = page.get_textpage(clip=area)
                text_page = text_page.extractDICT(area)
                # text_page = area
                max_length = fitz.getTextlength(str(max(column, key=len)), fontsize=fontsize)+14
                area = format_border(page, area, data, fontsize, align=align, max_length=max_length)
                area.y1 = add_yrect_line(column, area.y1, area.y1-area.y0)
                col = fitz.utils.getColor("white")
                redaction = page.addRedactAnnot(new_area, fill=col, text=" ") #flags not available
                page.apply_redactions()  # page.apply_redations(images=fitz.PDF_REDACT_IMAGE_NONE) to circumvent transparent image issues
                writer = fitz.TextWriter(page.rect, color=color)
                # align to top of box if align right:
                writer.fill_textbox(new_area, variable, fontsize=fontsize, warn=True, align=align, font=font)
                writer.write_text(page)
                # To show what happened, draw the rectangles, etc.
                shape = page.newShape()
                shape.drawRect(new_area)  # the rect within which we had to stay
                shape.finish(stroke_opacity=0)  # show in red color
                shape.commit()

                shape = page.newShape()
                shape.drawRect(writer.text_rect)  # the generated TextWriter rectangle
                shape.drawCircle(writer.last_point, 2)  # coordinates of end of text
                shape.finish(stroke_opacity=0)  # show with blue color
                shape.commit()
                writer = fitz.TextWriter(area, color=color)
like image 535
polymath Avatar asked Oct 28 '25 08:10

polymath


1 Answers

A little bit hard to tell without knowing more detail about the PDF page you are dealing with. Inserting text or drawings does not add high data volumes however. So I presume that applying redactions may cause the issue: If your page contains images that overlap any of your redaction rectangles, the apply_redactions() (with no arguments!) will modify the overlapping image parts and blank them out ... this will happen for every image and each of its overlaps! The result is an uncompressed new PNG version of each image. So you should try one of the following:

  • do not touch any images: use page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
  • remove every image with at least one overlap (may be undesireable): page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_REMOVE)
  • or, at least, use garbage=3, deflate=True when saving the file to compress modified images.

Actually you should always use garbage collection and compression after these types of operation.

like image 140
Jorj McKie Avatar answered Oct 30 '25 23:10

Jorj McKie



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!