Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find a list in docx using python?

I'm trying to pull apart a word document that looks like this:

1.0 List item
1.1 List item
1.2 List item
2.0 List item

It is stored in docx, and I'm using python-docx to try to parse it. Unfortunately, it loses all the numbering at the start. I'm trying to identify the start of each ordered list item.

The python-docx library also allows me to access styles, but I cannot figure out how to determine whether the style is a list style or not.

So far I've been messing around with a function and checking output, but the standard format is something like:

    for p in doc.paragraphs:
        s = p.style
        while s.base_style is not None:
            print s.name
            s = s.base_style
        print s.name

Which I've been using to try to search up through the custom styles, but the all end at "Normal," as opposed to the "ListNumber."

I've tried searching styles under the document, the paragraphs, and the runs without luck. I've also tried searching p.text, but as previously mentioned the numbering does not persist.

like image 472
SeanVDH Avatar asked Jun 01 '15 20:06

SeanVDH


1 Answers

List items can be implemented in the XML in a variety of ways. Unfortunately the most common way, adding list items using the toolbar (as opposed to using styles) is also probably the most complex.

Best bet is to start using opc-diag to have a look at the XML that's being used inside the document.xml and then formulating a strategy from there.

The list-handling API for python-docx hasn't really been implemented yet, so you'll need to operate at the lxml level if you want to get this done with today's version.

like image 191
scanny Avatar answered Oct 09 '22 19:10

scanny