I want to remove all the personal information from the comments inside a word file.
Removing the Authors name is fine, I did that using the following,
document = Document('sampleFile.docx')
core_properties = document.core_properties
core_properties.author = ""
document.save('new-filename.docx')
But this is not what I need, I want to remove the name of any person who commented inside that word file.
The way we do it manually is by going into Preferences->security->remove personal information from this file on save
If you want to remove personal information from the comments in .docx
file, you'll have to dive deep into the file itself.
So, .docx
is just a .zip
archive with word-specific files. We need to overwrite some internal files of it, and the easiest way to do it that I could find is to copy all the files to memory, change whatever we have to change and put it all to a new file.
import re
import os
from zipfile import ZipFile
docx_file_name = '/path/to/your/document.docx'
files = dict()
# We read all of the files and store them in "files" dictionary.
document_as_zip = ZipFile(docx_file_name, 'r')
for internal_file in document_as_zip.infolist():
file_reader = document_as_zip.open(internal_file.filename, "r")
files[internal_file.filename] = file_reader.readlines()
file_reader.close()
# We don't need to read anything more, so we close the file.
document_as_zip.close()
# If there are any comments.
if "word/comments.xml" in files.keys():
# We will be working on comments file...
comments = files["word/comments.xml"]
comments_new = str()
# Files contents have been read as list of byte strings.
for comment in comments:
if isinstance(comment, bytes):
# Change every author to "Unknown Author".
comments_new += re.sub(r'w:author="[^"]*"', "w:author=\"Unknown Author\"", comment.decode())
files["word/comments.xml"] = comments_new
# Remove the old .docx file.
os.remove(docx_file_name)
# Now we want to save old files to the new archive.
document_as_zip = ZipFile(docx_file_name, 'w')
for internal_file_name in files.keys():
# Those are lists of byte strings, so we merge them...
merged_binary_data = str()
for binary_data in files[internal_file_name]:
# If the file was not edited (therefore is not the comments.xml file).
if not isinstance(binary_data, str):
binary_data = binary_data.decode()
# Merge file contents.
merged_binary_data += binary_data
# We write old file contents to new file in new .docx.
document_as_zip.writestr(internal_file_name, merged_binary_data)
# Close file for writing.
document_as_zip.close()
The core properties recognised by the CoreProperties class are listed in the official documentation: http://python-docx.readthedocs.io/en/latest/api/document.html#coreproperties-objects
To overwrite all of them you can set them to an empty string like the one you used to overwrite the authors metadata:
document = Document('sampleFile.docx')
core_properties = document.core_properties
meta_fields= ["author", "category", "comments", "content_status", "created", "identifier", "keywords", "language", "revision", "subject", "title", "version"]
for meta_field in meta_fields:
setattr(core_properties, meta_field, "")
document.save('new-filename.docx')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With