is there a feature in python docx library to compute the number of pages in a document?
Not at the moment, but, unlike a way to tell where the page breaks are in the content, such a feature could be developed. At least if you were satisfied with whatever Word reported last time it saved the document.
This statistic is saved in the app.xml properties "part" by Word on each save. So if you were confident the document you were inspecting had last been saved by Word (or LibreOffice I expect would work too), then that method should be pretty reliable. If the document were generated by, say, python-docx, that statistic would be unreliable.
If this is a feature you're interested in, feel free to add it to the GitHub issues list: https://github.com/python-openxml/python-docx/issues
I came up with this. Works for both pptx and docx files:
import zipfile
import re
archive = zipfile.ZipFile("myDocxOrPptxFile.docx", "r")
ms_data = archive.read("docProps/app.xml")
archive.close()
app_xml = ms_data.decode("utf-8")
regex = r"<(Pages|Slides)>(\d+)</(Pages|Slides)>"
matches = re.findall(regex, app_xml, re.MULTILINE)
match = matches[0] if matches[0:] else [0, 0]
page_count = match[1]
print(page_count)
Office formats are just zip files with XML contents inside. You can read the contents of those files and parse them as you please.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With