Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

number of pages in word document

Tags:

python-docx

is there a feature in python docx library to compute the number of pages in a document?

like image 425
omri_saadon Avatar asked Oct 23 '25 14:10

omri_saadon


2 Answers

Not at the moment, but, unlike a way to tell where the page breaks are in the content, such a feature could be developed. At least if you were satisfied with whatever Word reported last time it saved the document.

This statistic is saved in the app.xml properties "part" by Word on each save. So if you were confident the document you were inspecting had last been saved by Word (or LibreOffice I expect would work too), then that method should be pretty reliable. If the document were generated by, say, python-docx, that statistic would be unreliable.

If this is a feature you're interested in, feel free to add it to the GitHub issues list: https://github.com/python-openxml/python-docx/issues

like image 57
scanny Avatar answered Oct 25 '25 05:10

scanny


I came up with this. Works for both pptx and docx files:

import zipfile
import re

archive = zipfile.ZipFile("myDocxOrPptxFile.docx", "r")
ms_data = archive.read("docProps/app.xml")
archive.close()
app_xml = ms_data.decode("utf-8")

regex = r"<(Pages|Slides)>(\d+)</(Pages|Slides)>"

matches = re.findall(regex, app_xml, re.MULTILINE)
match = matches[0] if matches[0:] else [0, 0]
page_count = match[1]

print(page_count)

Office formats are just zip files with XML contents inside. You can read the contents of those files and parse them as you please.

like image 34
Richard Avatar answered Oct 25 '25 04:10

Richard