I'm trying to pull apart a word document that looks like this:
1.0 List item
1.1 List item
1.2 List item
2.0 List item
It is stored in docx, and I'm using python-docx to try to parse it. Unfortunately, it loses all the numbering at the start. I'm trying to identify the start of each ordered list item.
The python-docx library also allows me to access styles, but I cannot figure out how to determine whether the style is a list style or not.
So far I've been messing around with a function and checking output, but the standard format is something like:
for p in doc.paragraphs:
s = p.style
while s.base_style is not None:
print s.name
s = s.base_style
print s.name
Which I've been using to try to search up through the custom styles, but the all end at "Normal," as opposed to the "ListNumber."
I've tried searching styles under the document, the paragraphs, and the runs without luck. I've also tried searching p.text, but as previously mentioned the numbering does not persist.
List items can be implemented in the XML in a variety of ways. Unfortunately the most common way, adding list items using the toolbar (as opposed to using styles) is also probably the most complex.
Best bet is to start using opc-diag to have a look at the XML that's being used inside the document.xml and then formulating a strategy from there.
The list-handling API for python-docx hasn't really been implemented yet, so you'll need to operate at the lxml level if you want to get this done with today's version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With