Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python docx paragraph in textbox

Is there any way to access and manipulate text in an existing docx document in a textbox with python-docx?

I tried to find a keyword in all paragraphs in a document by iteration:

doc = Document('test.docx')

for paragraph in doc.paragraphs:
    if '<DATE>' in paragraph.text:
        print('found date: ', paragraph.text)

It is found if placed in normal text, but not inside a textbox.

like image 517
Stefan Avatar asked Apr 27 '16 11:04

Stefan


People also ask

How do you create a paragraph in DOCX Python?

You can directly use add_paragraph() method to add paragraph but if you want to set a new font style of the text you have to use add_run() as all the block-level formatting is done by using add_paragraph() method while all the character-level formatting is done by using add_run().

How do you add a line break in DOCX Python?

Looking into this some more, if you press Shift + Enter in Word it adds a manual line break (not a paragraph) via appending Chr(11) . In Open XML, this translates to a Break.

How do you make a paragraph in Python?

Use linebreak command \n several times as you wish and I suggest you use triple quotes as follows: print(""" First paragraph \n\n Second paragraph \n\n ... Last paragraph. """)


2 Answers

A workaround for textboxes that contain only formatted text is to use a floating, formatted table. It can be styled almost like a textbox (frames, colours, etc.) and is easily accessible by the docx API.

doc = Document('test.docx')

for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                if '<DATE>' in paragraph.text:
                   print('found date: ', paragraph.text)
like image 95
Stefan Avatar answered Oct 06 '22 22:10

Stefan


Not via the API, not yet at least. You'd have to uncover the XML structure it lives in and go down to the lxml level and perhaps XPath to find it. Something like this might be a start:

body = doc._body
# assuming differentiating container element is w:textBox
text_box_p_elements = body.xpath('.//w:textBox//w:p')

I have no idea whether textBox is the actual element name here, you'd have to sort that out with the rest of the XPath path details, but this approach will likely work. I use similar approaches frequently to work around features that aren't built into the API yet.

opc-diag is a useful tool for inspecting the XML. The basic approach is to create a minimally small .docx file containing the type of thing you're trying to locate. Then use opc-diag to inspect the XML Word generates when you save the file:

$ opc browse test.docx document.xml

http://opc-diag.readthedocs.org/en/latest/index.html

like image 35
scanny Avatar answered Oct 07 '22 00:10

scanny