Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to iterate over everything in a python-docx document?

I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...

for para in doc.paragraphs:
    for run in para.runs:
        # How to tell if this run has images or tables?

...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?

Thanks!

like image 880
thebitguru Avatar asked Aug 05 '14 03:08

thebitguru


2 Answers

There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.

python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you: https://github.com/python-openxml/python-docx/issues/40

There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.

This requirement comes up from time to time so we'll definitely want to add it at some point.

Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.

like image 106
scanny Avatar answered Nov 05 '22 21:11

scanny


Assuming doc is of type Document, then what you want to do is have 3 separate iterations:

  • One for the paragraphs, as you have in your code
  • One for the tables, via doc.tables
  • One for the shapes, via doc.inline_shapes

The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.

Here is the documentation for more info: python-docx

like image 43
mleyfman Avatar answered Nov 05 '22 19:11

mleyfman