Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Remove header and footer from docx file

I need to remove headers and footers in many docx files. I was currently trying using python-docx library, but it doesn't support header and footer in docx document at this time (work in progress).

Is there any way to achieve that in Python?

As I understand, docx is a xml-based format, but I don't know how to use it.

P.S.I have an idea to use lxml or BeautifulSoup to parse xml and replace some parts, but it looks dirty

UPD. Thanks to Shawn, for a good start point. I was made some changes to script. This is my final version (it's usefull for me, because I need to edit many .docx files. I'm using BeautifulSoup, because standart xml parser can't get a valid xml-tree. Also, my docx documents doesn't have header and footer in xml. They just placed the header's and footer's images in a top of page. Also, for more speed you can use lxml instead of Soup.

import zipfile
import shutil as su
import os
import tempfile
from bs4 import BeautifulSoup


def get_xml_from_docx(docx_filename):
    """
        Return content of document.xml file inside docx document
    """
    with zipfile.ZipFile(docx_filename) as zf:
        xml_info = zf.read('word/document.xml')
    return xml_info


def write_and_close_docx(self, edited_xml, output_filename):
    """ Create a temp directory, expand the original docx zip.
        Write the modified xml to word/document.xml
        Zip it up as the new docx
    """
    tmp_dir = tempfile.mkdtemp()

    with zipfile.ZipFile(self) as zf:
        zf.extractall(tmp_dir)

    with open(os.path.join(tmp_dir, 'word/document.xml'), 'w') as f:
        f.write(str(edited_xml))

    # Get a list of all the files in the original docx zipfile
    filenames = zf.namelist()
    # Now, create the new zip file and add all the filex into the archive
    zip_copy_filename = output_filename
    docx = zipfile.ZipFile(zip_copy_filename, "w")
    for filename in filenames:
        docx.write(os.path.join(tmp_dir, filename), filename)

    # Clean up the temp dir
    su.rmtree(tmp_dir)


if __name__ == '__main__':
    directory = 'your_directory/'
    files = os.listdir(directory)
    for file in files:
        if file.endswith('.docx'):
            word_doc = directory + file
            new_word_doc = 'edited/' + file.rstrip('.docx') + '-edited.docx'
            tree = get_xml_from_docx(word_doc)
            soup = BeautifulSoup(tree, 'xml')
            shapes = soup.find_all('shape')
            for shape in shapes:
                if 'margin-left:0pt' in shape.get('style'):
                    shape.parent.decompose()
            write_and_close_docx(word_doc, soup, new_word_doc)

So, that's it :) I know, the code isn't clean, sorry for that.

like image 756
drjackild Avatar asked Aug 22 '15 17:08

drjackild


1 Answers

Well, I've never thought about it, but I just created a test.docx with a header and a footer. Once you have that docx, you can unzip it to get the constituent XML files. For my simple test case this yielded:

word/
_rels           footer1.xml     styles.xml
document.xml        footnotes.xml       stylesWithEffects.xml
endnotes.xml        header1.xml     theme
fontTable.xml       settings.xml        webSettings.xml

Opening up the word/documents.xml gives you the main problem area. You can see that there are elements in there with header and footer involved. In my simple case I got:

<w:headerReference w:type="default" r:id="rId7"/>
<w:footerReference w:type="default" r:id="rId8"/>

and

<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>

All of the doc is actually small, so

<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">
<w:body>
  <w:p w:rsidR="009E6E8F" w:rsidRDefault="009E6E8F"/>
  <w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/>
  <w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/><w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA">
  <w:r>
  <w:t>MY BODY</w:t>
  </w:r>
  <w:bookmarkStart w:id="0" w:name="_GoBack"/>
  <w:bookmarkEnd w:id="0"/>
  </w:p>
  <w:sectPr w:rsidR="00B53FFA" w:rsidSect="009E6E8F">
  <w:headerReference w:type="default" r:id="rId7"/> 
  <w:footerReference w:type="default" r:id="rId8"/>
  <w:pgSz w:w="12240" w:h="15840"/>
  <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>"""

So XML manipulation is not going to be a problem, either in function or in performance for something this size. Here is some code that should get your doc into python, parsed as an xml tree, and saved out back as a docx. I have to go out now so this isn't your complete solution, but I think that this should get you well down the path. If you are still having trouble I will return later and see where you are with it.

import zipfile
import shutil as su
import os
import tempfile
import xml.etree.cElementTree


 def get_word_xml(docx_filename):
   with open(docx_filename, mode='rt') as f:
      zip = zipfile.ZipFile(f)
      xml_content = zip.read('word/document.xml')
   return xml_content


def write_and_close_docx (self, xml_content, output_filename):
        """ Create a temp directory, expand the original docx zip.
            Write the modified xml to word/document.xml
            Zip it up as the new docx
        """

        tmp_dir = tempfile.mkdtemp()

        self.zipfile.extractall(tmp_dir)

        with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f:
            xmlstr = tree.tostring(xml_content, pretty_print=True)
            f.write(xmlstr)

        # Get a list of all the files in the original docx zipfile
        filenames = self.zipfile.namelist()
        # Now, create the new zip file and add all the filex into the archive
        zip_copy_filename = output_filename
        with zipfile.ZipFile(zip_copy_filename, "w") as docx:
            for filename in filenames:
                docx.write(os.path.join(tmp_dir,filename), filename)

        # Clean up the temp dir
        su.rmtree(tmp_dir)

def get_xml_tree(f):
    return xml.etree.ElementTree.parse(f)

word_doc = 'TEXT.docx'
new_word_doc = 'SLIM.docx'
doc = get_word_xml(word_doc)
tree = get_xml_tree(doc)
write_and_close_docx(word_doc, tree, new_word_doc)
like image 168
Shawn Mehan Avatar answered Sep 21 '22 18:09

Shawn Mehan