Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find and replace text in .docx file - Python

I've been doing a lot of searching for a method to find and replace text in a docx file with little luck. I've tried the docx module and could not get that to work. Eventually I worked out the method described below using the zipfile module and replacing the document.xml file in the docx archive. For this to work you need a template document (docx) with the text you want to replace as unique strings that could not possibly match any other existing or future text in the document (eg. "The meeting with XXXCLIENTNAMEXXX on XXXMEETDATEXXX went very well.").

import zipfile

replaceText = {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}
templateDocx = zipfile.ZipFile("C:/Template.docx")
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a")

with open(templateDocx.extract("word/document.xml", "C:/")) as tempXmlFile:
    tempXmlStr = tempXmlFile.read()

for key in replaceText.keys():
    tempXmlStr = tempXmlStr.replace(str(key), str(replaceText.get(key)))

with open("C:/temp.xml", "w+") as tempXmlFile:
    tempXmlFile.write(tempXmlStr)

for file in templateDocx.filelist:
    if not file.filename == "word/document.xml":
        newDocx.writestr(file.filename, templateDocx.read(file))

newDocx.write("C:/temp.xml", "word/document.xml")

templateDocx.close()
newDocx.close()

My question is what's wrong with this method? I'm pretty new to this stuff, so I feel someone else should have figured this out already. Which leads me to believe there is something very wrong with this approach. But it works! What am I missing here?

.

Here is a walkthrough of my thought process for everyone else trying to learn this stuff:

Step 1) Prepare a Python dictionary of the text strings you want to replace as keys and the new text as items (eg. {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}).

Step 2) Open the template docx file using the zipfile module.

Step 3) Open a new new docx file with the append access mode.

Step 4) Extract the document.xml (where all the text lives) from the template docx file and read the xml to a text string variable.

Step 5) Use a for loop to replace all of the text defined in your dictionary in the xml text string with your new text.

Step 6) Write the xml text string to a new temporary xml file.

Step 7) Use a for loop and the zipfile module to copy all of the files in the template docx archive to a new docx archive EXCEPT the word/document.xml file.

Step 8) Write the temporary xml file with the replaced text to the new docx archive as a new word/document.xml file.

Step 9) Close your template and new docx archives.

Step 10) Open your new docx document and enjoy your replaced text!

--Edit-- Missing closing parentheses ')' on lines 7 and 11

like image 844
cvdellen Avatar asked May 31 '13 23:05

cvdellen


People also ask

How do I find and replace in DOCX?

Go to Home > Replace. Enter the word or phrase you want to replace in Find what.

How do you replace text in text in Python?

Python String | replace() replace() is an inbuilt function in the Python programming language that returns a copy of the string where all occurrences of a substring are replaced with another substring. Parameters : old – old substring you want to replace. new – new substring which would replace the old substring.


1 Answers

Sometimes, Word does strange things. You should try to remove the text and rewrite it in one stroke, eg without editing the text in the middle

Your document is saved in a xml file (usually in word/document.xml for docx, afer unzipping). Sometimes it is possible that your text won't be in one stroke: it is possible that somewhere in the document, they is XXXCLIENT and somewhere else they is NAMEXXX.

Something like this:

<w:t> XXXCLIENT </w:t> ... <w:t> NAMEXXX </w:t>

This happens quite often because of language support : word splits words when he thinks that one word is of one specific language, and may do so between words, that will split the words into multiple tags.

Only problem with your solution is that you have to write everything in one stroke, which isn't the most user-friendly.

I have created a JS Library that uses mustache like tags: {clientName} https://github.com/edi9999/docxgenjs

It works globally the same as your algorithm but won't crash if the content is not in one stroke (when you write {clientName} in Word, the text will usually be splitted: {, clientName, } in the document.

like image 75
edi9999 Avatar answered Oct 01 '22 00:10

edi9999