Is there any way to read .docx file include auto numbering using python-docx

Tags:

Problem statement: Extract sections from .docx file including autonumbering.

I tried python-docx to extract text from .docx file but it excludes the autonumbering.

from docx import Document

document = Document("wadali.docx")


def iter_items(paragraphs):
    for paragraph in document.paragraphs:
        if paragraph.style.name.startswith('Agt'):
            yield paragraph
        if paragraph.style.name.startswith('TOC'):
            yield paragraph
        if paragraph.style.name.startswith('Heading'):
            yield paragraph
        if paragraph.style.name.startswith('Title'):
            yield paragraph
        if paragraph.style.name.startswith('Heading'):
            yield paragraph
        if paragraph.style.name.startswith('Table Normal'):
            yield paragraph
        if paragraph.style.name.startswith('List'):
            yield paragraph


for item in iter_items(document.paragraphs):
    print item.text

316

asked Aug 30 '18 09:08

wadali

2 Answers

It appears that currently python-docx v0.8 does not fully support numbering. You need to do some hacking.

First, for the demo, to iterate the document paragraphs, you need to write your own iterator. Here is something functional:

import docx.document
import docx.oxml.table
import docx.oxml.text.paragraph
import docx.table
import docx.text.paragraph


def iter_paragraphs(parent, recursive=True):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, docx.document.Document):
        parent_elm = parent.element.body
    elif isinstance(parent, docx.table._Cell):
        parent_elm = parent._tc
    else:
        raise TypeError(repr(type(parent)))

    for child in parent_elm.iterchildren():
        if isinstance(child, docx.oxml.text.paragraph.CT_P):
            yield docx.text.paragraph.Paragraph(child, parent)
        elif isinstance(child, docx.oxml.table.CT_Tbl):
            if recursive:
                table = docx.table.Table(child, parent)
                for row in table.rows:
                    for cell in row.cells:
                        for child_paragraph in iter_paragraphs(cell):
                            yield child_paragraph

You can use it to find all document paragraphs including paragraphs in table cells.

For instance:

import docx

document = docx.Document("sample.docx")
for paragraph in iter_paragraphs(document):
    print(paragraph.text)

To access the numbering property, you need to search in the "protected" members paragraph._p.pPr.numPr, which is a docx.oxml.numbering.CT_NumPr object:

for paragraph in iter_paragraphs(document):
    num_pr = paragraph._p.pPr.numPr
    if num_pr is not None:
        print(num_pr)  # type: docx.oxml.numbering.CT_NumPr

Note that this object is extracted from the numbering.xml file (inside the docx), if it exists.

To access it, you need to read your docx file like a package. For instance:

import docx.package
import docx.parts.document
import docx.parts.numbering

package = docx.package.Package.open("sample.docx")

main_document_part = package.main_document_part
assert isinstance(main_document_part, docx.parts.document.DocumentPart)

numbering_part = main_document_part.numbering_part
assert isinstance(numbering_part, docx.parts.numbering.NumberingPart)

ct_numbering = numbering_part._element
print(ct_numbering)  # CT_Numbering
for num in ct_numbering.num_lst:
    print(num)  # CT_Num
    print(num.abstractNumId)  # CT_DecimalNumber

Mor information is available in the Office Open XMl documentation.

140

answered Sep 28 '22 11:09

Laurent LAPORTE

There is a package, docx2python which does this in a lot simpler fashion: pypi.org/project/docx2python/

The following code:

from docx2python import docx2python
document = docx2python("C:/input/MyDoc.docx")
print(document.body)

produces a list which contains the contents including bullet lists in a nice parse-able fashion.

answered Sep 28 '22 11:09

Elliott

Related questions
                            
                                Annotation Processor appears to break Java generics
                            
                                Pass data/bundle using navigateUp in Android Navigation Component
                            
                                Getting "400 This page expects a form submission" when making a rest call to trigger a Jenkins Job
                            
                                Django - how to visualize signals and save overrides?
                            
                                How to fix: "error [email protected]: The platform "linux" is incompatible with this module."
                            
                                Vue npm run serve starts on random port
                            
                                Function declaration in block moving temporary value outside of block?
                            
                                Multiple WebSecurityConfigurerAdapter in spring boot for multiple patterns
                            
                                Can't select same row twice in SwiftUI
                            
                                Sticky sessions on Kubernetes cluster
                            
                                Best strategy to write hooks for subversion in Windows [closed]
                            
                                Best Practice: Legitimate Cross-Site Scripting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With