Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Page number python-docx

I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?

like image 916
lehast22 Avatar asked Mar 24 '16 04:03

lehast22


2 Answers

Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.

However, certain clients place a <w:lastRenderedPageBreak> element in the saved XML to indicate where they broke the page last time it was rendered.

I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.

If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.

If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.

If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.

like image 169
scanny Avatar answered Sep 22 '22 07:09

scanny


Using Python-docx: identify a page break in paragraph

from docx import Document
fn='1.doc'
document = Document(fn)
pn=1    
import re
for p in document.paragraphs:
    r=re.match('Chapter \d+',p.text)
    if r:
        print(r.group(),pn)
    for run in p.runs:
        if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
            pn+=1
            print('!!','='*50,pn)
like image 27
Smart Manoj Avatar answered Sep 21 '22 07:09

Smart Manoj