For the purposes of unit testing, I want to check that the XML produced for a Word paragraph is what I expect when I parse an HTML paragraph.
How do I extract the XML itself instead of writing to a file, unzipping the file, and re-reading the word/document.xml file it contains?
e.g.
from docx import Document
import bs4
def add_parsed_html_to_paragraph(p, s):
soup = bs4.BeautifulSoup(s)
para = soup.find('p')
for e in para.children:
if type(e) == bs4.element.NavigableString:
r = p.add_run(str(e))
else:
r = p.add_run(e.text)
if e.name == 'sub':
r.font.subscript = True
elif e.name == 'sup':
r.font.superscript = True
title = 'A formula: H<sub>2</sub>O.'
document = Document()
p = document.add_paragraph()
add_parsed_html_to_paragraph(p, title)
# ... Now I want to check p or document for the correct XML
Each so-called oxml
element object in python-docx
has an .xml
property for precisely this use case. It's used for the internal unit tests.
All you need is access to the internal variable used for the XML element, which is generally available by clicking the [source]
link next to that object in the docs, like here: https://python-docx.readthedocs.io/en/latest/api/text.html#paragraph-objects
Clicking through that link, you can find that for a paragraph, the underlying XML element is available on ._p
. Usually it's the tagname of the element without the namespace prefix, although sometimes its the generic ._element
. This latter one is a good one to try in a pinch if you need to guess.
So using it is as simple as:
>>> paragraph._p.xml
<w:p>
<w:pPr>
<w:jc w:val="right"/>
</w:pPr>
<w:r>
<w:t>Right-aligned</w:t>
</w:r>
</w:p>
There is a companion domain-specific language (DSL) in the unit-test utilities called CXML (compact XML) which allows you to take care of namespacing, which is otherwise a big pain. It looks something like this:
expected_xml = cxml.xml('w:p(w:pPr/w:jc{w:val=right},w:r/w:t"Right-aligned")')
You can see examples throughout the unit tests like here: https://github.com/python-openxml/python-docx/blob/master/tests/text/test_paragraph.py#L113 and ask more specific questions here with the "python-docx" tag if you need help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With