Remove all images from docx files

Question

I've searched the documentation for python-docx and other packages, as well as stack-overflow, but could not find how to remove all images from docx files with python.

My exact use-case: I need to convert hundreds of word documents to "draft" format to be viewed by clients. Those drafts should be identical the original documents but all the images must be deleted / redacted from them.

Sorry for not including an example of things I tried, what I have tried is hours of research that didn't give any info. I found this question on how to extract images from word files, but that doesn't delete them from the actual document: Extract pictures from Word and Excel with Python

From there and other sources I've found out that docx files could be read as simple zip files, I don't know if that means that it's possible to "re-zip" without the images without affecting the integrity of the docx file (edit: simply deleting the images works, but prevents python-docx from continuing to work with this file because of missing references to images), but thought this might be a path to a solution.

Any ideas?

mata · Accepted Answer

If your goal is to redact images maybe this code I used for a similar usecase could be useful:

import sys
import zipfile
from PIL import Image, ImageFilter
import io

blur = ImageFilter.GaussianBlur(40)

def redact_images(filename):
    outfile = filename.replace(".docx", "_redacted.docx")
    with zipfile.ZipFile(filename) as inzip:
        with zipfile.ZipFile(outfile, "w") as outzip:
            for info in inzip.infolist():
                name = info.filename
                print(info)
                content = inzip.read(info)
                if name.endswith((".png", ".jpeg", ".gif")):
                        fmt = name.split(".")[-1]
                        img = Image.open(io.BytesIO(content))
                        img = img.convert().filter(blur)
                        outb = io.BytesIO()
                        img.save(outb, fmt)
                        content = outb.getvalue()
                        info.file_size = len(content)
                        info.CRC = zipfile.crc32(content)
                outzip.writestr(info, content)

Here I used PIL to blur images in some files, but instead of the blur filter any other suitable operation could be used. This worked quite nicely for my usecase.

Remove all images from docx files

Tags:

python

python-docx

docx

Ofer Sadan

1 Answers

mata

Recent Activity

Donate For Us

Remove all images from docx files

Tags:

python

python-docx

docx

Ofer Sadan

1 Answers

mata

Related questions

Recent Activity

Donate For Us