Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I view the XML produced by the python-docx package

For the purposes of unit testing, I want to check that the XML produced for a Word paragraph is what I expect when I parse an HTML paragraph.

How do I extract the XML itself instead of writing to a file, unzipping the file, and re-reading the word/document.xml file it contains?

e.g.

from docx import Document
import bs4

def add_parsed_html_to_paragraph(p, s):
    soup = bs4.BeautifulSoup(s)
    para = soup.find('p')
    for e in para.children:
        if type(e) == bs4.element.NavigableString:
            r = p.add_run(str(e))
        else:
            r = p.add_run(e.text)
        if e.name == 'sub':
            r.font.subscript = True
        elif e.name == 'sup':
            r.font.superscript = True


title = 'A formula: H<sub>2</sub>O.'

document = Document()
p = document.add_paragraph()
add_parsed_html_to_paragraph(p, title)

# ... Now I want to check p or document for the correct XML
like image 368
xnx Avatar asked May 29 '19 09:05

xnx


1 Answers

Each so-called oxml element object in python-docx has an .xml property for precisely this use case. It's used for the internal unit tests.

All you need is access to the internal variable used for the XML element, which is generally available by clicking the [source] link next to that object in the docs, like here: https://python-docx.readthedocs.io/en/latest/api/text.html#paragraph-objects

Clicking through that link, you can find that for a paragraph, the underlying XML element is available on ._p. Usually it's the tagname of the element without the namespace prefix, although sometimes its the generic ._element. This latter one is a good one to try in a pinch if you need to guess.

So using it is as simple as:

>>> paragraph._p.xml
<w:p>
  <w:pPr>
    <w:jc w:val="right"/>
  </w:pPr>
  <w:r>
    <w:t>Right-aligned</w:t>
  </w:r>
</w:p>

There is a companion domain-specific language (DSL) in the unit-test utilities called CXML (compact XML) which allows you to take care of namespacing, which is otherwise a big pain. It looks something like this:

expected_xml = cxml.xml('w:p(w:pPr/w:jc{w:val=right},w:r/w:t"Right-aligned")')

You can see examples throughout the unit tests like here: https://github.com/python-openxml/python-docx/blob/master/tests/text/test_paragraph.py#L113 and ask more specific questions here with the "python-docx" tag if you need help.

like image 136
scanny Avatar answered Sep 21 '22 06:09

scanny