Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing objects in order in docx

I want to process objects in the order they are written in a word document. Objects I have encountered are paragraphs, text in paragraphs, runs in paragraphs, text in runs, tables, and paragraphs in a table's cells. So far I have two useful programs. One which goes through the document's paragraphs and acquires the text of the paragraph; stored in a list indexed by [paragraph #]. This same program has the ability to gather the text from runs;stored in 2D list indexed by[paragraph#][run#], but I have not found the runs more useful than the whole text of the paragraph. My second program goes through the whole document and finds tables. When it has a table it goes through the table by row, cell, and the paragraph in the cell.

Now these seem like great building blocks for my goal. I would like to gather text in order. Abstractly, as if a blinking text cursor was being commanded to move by a person holding down the right arrow on a keyboard. As the text cursor moves over objects it is storing them by several indexes labeling the # of the object and the type of the object.

Say I have the sub functions paragraph_read and table_read. Say the document has this order of objects: . I'd like to go through these and perform my sub functions in this order: paragraph_read, paragraph_read, table_read, paragraph_read

I would like to know if my program can move through a document object by object like a cursor swiping right.

Help is greatly appreaciated. Thanks.

-Chris

like image 783
Chris Avatar asked Jun 10 '26 08:06

Chris


1 Answers

UPDATE

There are some new methods in python-docx that take care of much of the detail here:

Document.iter_inner_content() - provides access to the Paragraph and Table objects in a document, in document order:

for block_item in document:
    if isinstance(block_item, Paragraph):
        ... process paragraph ...
    elif isinstance(block_item, Table):
        ... process table ...

A table cell is also a block-item container and has the same method. This allows recursing into tables if you want that.

Header and Footer objects are also block-item containers and have this method.

A Section is not a block-item container per-se, but does have this method for when you want to iterate through the document section-by-section.


You need to add this function to your code somewhere convenient:

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document
    order. Each returned value is an instance of either Table or
    Paragraph. *parent* would most commonly be a reference to a main
    Document object, but also works for a _Cell object, which itself can
    contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

Then you use it like this:

document = Document('my_document.docx')

for block_item in iter_block_items(document):
    if isinstance(block_item, Paragraph):
        do_paragraph_thing(paragraph=block_item)
    elif isinstance(block_item, Table):
        do_table_thing(table=block_item)
    else:
        # raise an exception or do nothing or whatever. This branch would
        # only be reached on an unforeseen error.
like image 86
scanny Avatar answered Jun 11 '26 23:06

scanny



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!