I need to remove headers and footers in many docx files. I was currently trying using python-docx library, but it doesn't support header and footer in docx document at this time (work in progress).
Is there any way to achieve that in Python?
As I understand, docx is a xml-based format, but I don't know how to use it.
P.S.I have an idea to use lxml or BeautifulSoup to parse xml and replace some parts, but it looks dirty
UPD. Thanks to Shawn, for a good start point. I was made some changes to script. This is my final version (it's usefull for me, because I need to edit many .docx files. I'm using BeautifulSoup, because standart xml parser can't get a valid xml-tree. Also, my docx documents doesn't have header and footer in xml. They just placed the header's and footer's images in a top of page. Also, for more speed you can use lxml instead of Soup.
import zipfile
import shutil as su
import os
import tempfile
from bs4 import BeautifulSoup
def get_xml_from_docx(docx_filename):
"""
Return content of document.xml file inside docx document
"""
with zipfile.ZipFile(docx_filename) as zf:
xml_info = zf.read('word/document.xml')
return xml_info
def write_and_close_docx(self, edited_xml, output_filename):
""" Create a temp directory, expand the original docx zip.
Write the modified xml to word/document.xml
Zip it up as the new docx
"""
tmp_dir = tempfile.mkdtemp()
with zipfile.ZipFile(self) as zf:
zf.extractall(tmp_dir)
with open(os.path.join(tmp_dir, 'word/document.xml'), 'w') as f:
f.write(str(edited_xml))
# Get a list of all the files in the original docx zipfile
filenames = zf.namelist()
# Now, create the new zip file and add all the filex into the archive
zip_copy_filename = output_filename
docx = zipfile.ZipFile(zip_copy_filename, "w")
for filename in filenames:
docx.write(os.path.join(tmp_dir, filename), filename)
# Clean up the temp dir
su.rmtree(tmp_dir)
if __name__ == '__main__':
directory = 'your_directory/'
files = os.listdir(directory)
for file in files:
if file.endswith('.docx'):
word_doc = directory + file
new_word_doc = 'edited/' + file.rstrip('.docx') + '-edited.docx'
tree = get_xml_from_docx(word_doc)
soup = BeautifulSoup(tree, 'xml')
shapes = soup.find_all('shape')
for shape in shapes:
if 'margin-left:0pt' in shape.get('style'):
shape.parent.decompose()
write_and_close_docx(word_doc, soup, new_word_doc)
So, that's it :) I know, the code isn't clean, sorry for that.
Well, I've never thought about it, but I just created a test.docx with a header and a footer. Once you have that docx, you can unzip
it to get the constituent XML files. For my simple test case this yielded:
word/
_rels footer1.xml styles.xml
document.xml footnotes.xml stylesWithEffects.xml
endnotes.xml header1.xml theme
fontTable.xml settings.xml webSettings.xml
Opening up the word/documents.xml
gives you the main problem area. You can see that there are elements in there with header and footer involved. In my simple case I got:
<w:headerReference w:type="default" r:id="rId7"/>
<w:footerReference w:type="default" r:id="rId8"/>
and
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
All of the doc is actually small, so
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">
<w:body>
<w:p w:rsidR="009E6E8F" w:rsidRDefault="009E6E8F"/>
<w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/>
<w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/><w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA">
<w:r>
<w:t>MY BODY</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="00B53FFA" w:rsidSect="009E6E8F">
<w:headerReference w:type="default" r:id="rId7"/>
<w:footerReference w:type="default" r:id="rId8"/>
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>"""
So XML manipulation is not going to be a problem, either in function or in performance for something this size. Here is some code that should get your doc into python, parsed as an xml tree, and saved out back as a docx. I have to go out now so this isn't your complete solution, but I think that this should get you well down the path. If you are still having trouble I will return later and see where you are with it.
import zipfile
import shutil as su
import os
import tempfile
import xml.etree.cElementTree
def get_word_xml(docx_filename):
with open(docx_filename, mode='rt') as f:
zip = zipfile.ZipFile(f)
xml_content = zip.read('word/document.xml')
return xml_content
def write_and_close_docx (self, xml_content, output_filename):
""" Create a temp directory, expand the original docx zip.
Write the modified xml to word/document.xml
Zip it up as the new docx
"""
tmp_dir = tempfile.mkdtemp()
self.zipfile.extractall(tmp_dir)
with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f:
xmlstr = tree.tostring(xml_content, pretty_print=True)
f.write(xmlstr)
# Get a list of all the files in the original docx zipfile
filenames = self.zipfile.namelist()
# Now, create the new zip file and add all the filex into the archive
zip_copy_filename = output_filename
with zipfile.ZipFile(zip_copy_filename, "w") as docx:
for filename in filenames:
docx.write(os.path.join(tmp_dir,filename), filename)
# Clean up the temp dir
su.rmtree(tmp_dir)
def get_xml_tree(f):
return xml.etree.ElementTree.parse(f)
word_doc = 'TEXT.docx'
new_word_doc = 'SLIM.docx'
doc = get_word_xml(word_doc)
tree = get_xml_tree(doc)
write_and_close_docx(word_doc, tree, new_word_doc)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With