Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing personal information from the comments in a word file using python

I want to remove all the personal information from the comments inside a word file.

Removing the Authors name is fine, I did that using the following,

document = Document('sampleFile.docx')
core_properties = document.core_properties
core_properties.author = ""
document.save('new-filename.docx')

But this is not what I need, I want to remove the name of any person who commented inside that word file.

The way we do it manually is by going into Preferences->security->remove personal information from this file on save

like image 999
sunil pawar Avatar asked Jan 06 '23 20:01

sunil pawar


2 Answers

If you want to remove personal information from the comments in .docx file, you'll have to dive deep into the file itself.

So, .docx is just a .zip archive with word-specific files. We need to overwrite some internal files of it, and the easiest way to do it that I could find is to copy all the files to memory, change whatever we have to change and put it all to a new file.

import re
import os
from zipfile import ZipFile

docx_file_name = '/path/to/your/document.docx'

files = dict()

# We read all of the files and store them in "files" dictionary.
document_as_zip = ZipFile(docx_file_name, 'r')
for internal_file in document_as_zip.infolist():
    file_reader = document_as_zip.open(internal_file.filename, "r")
    files[internal_file.filename] = file_reader.readlines()
    file_reader.close()

# We don't need to read anything more, so we close the file.
document_as_zip.close()

# If there are any comments.
if "word/comments.xml" in files.keys():
    # We will be working on comments file...
    comments = files["word/comments.xml"]

    comments_new = str()

    # Files contents have been read as list of byte strings.
    for comment in comments:
        if isinstance(comment, bytes):
            # Change every author to "Unknown Author".
            comments_new += re.sub(r'w:author="[^"]*"', "w:author=\"Unknown Author\"", comment.decode())

    files["word/comments.xml"] = comments_new

# Remove the old .docx file.
os.remove(docx_file_name)

# Now we want to save old files to the new archive.
document_as_zip = ZipFile(docx_file_name, 'w')
for internal_file_name in files.keys():
    # Those are lists of byte strings, so we merge them...
    merged_binary_data = str()
    for binary_data in files[internal_file_name]:
        # If the file was not edited (therefore is not the comments.xml file).
        if not isinstance(binary_data, str):
            binary_data = binary_data.decode()

        # Merge file contents.
        merged_binary_data += binary_data

    # We write old file contents to new file in new .docx.
    document_as_zip.writestr(internal_file_name, merged_binary_data)

# Close file for writing.
document_as_zip.close()
like image 173
Jezor Avatar answered Jan 08 '23 11:01

Jezor


The core properties recognised by the CoreProperties class are listed in the official documentation: http://python-docx.readthedocs.io/en/latest/api/document.html#coreproperties-objects

To overwrite all of them you can set them to an empty string like the one you used to overwrite the authors metadata:

document = Document('sampleFile.docx')
core_properties = document.core_properties
meta_fields= ["author", "category", "comments", "content_status", "created", "identifier", "keywords", "language", "revision", "subject", "title", "version"]
for meta_field in meta_fields:
    setattr(core_properties, meta_field, "")
document.save('new-filename.docx')
like image 21
marcanuy Avatar answered Jan 08 '23 11:01

marcanuy