I am looking for an elegant solution to find on what page(s) in a document a certain word occurs that I have stored in a python dictionary/list.
I first considered .docx format as an input and had a look at PythonDocx which has a search function, but there's obviously not really a pages attribute in the docx/xml format.
If I parse the document I could look for <w:br w:type="page"/>
occurrences in the xml tree but unfortunately these do not show non-forced page breaks.
I even considered converting files to PDF first and use something like PDFminer to parse the document page-wise.
Is there any straightforward solution to search a .docx document for a string and return the pages it occurs on like
[('foo' ,[1, 4, 7 ]), ('bar', [2]), ('baz', [2, 5, 8, 9 )]
Tip: You can also use the keyboard shortcuts Ctrl + f (Windows, Linux, and ChromeOS) or + f (Mac) to find a word or phrase quickly. You can find more info about a specific word or phrase on a page. Highlight a word or phrase. On a PC: Right-click the highlighted content.
Press Ctrl+F, and then type your search words.
Search in your documentIn the search box at the top of the Navigation pane, type the text that you want to find. Click a result to see it in your document, or browse through all the results by clicking the Next Search Result and Previous Search Result arrows.
It seems that the biggest challenge in your question is how to be able to parse a document page by page. This answer of a word document is not always the same and it depends on the margins, the paper sheet settings, the application you use to open it etc. A good reasoning on the accuracy of any script for this purpose can be found at google group.
However, if you can be satisfied with a almost 100% accurate, you start to find a solution as suggested in this google group:
I found that I can unzip the .docx file and extract
docProps/app.xml
, then parse the XML with ElementTree to get the<Pages></Pages>
element. I found that most of the time that number is accurate, but I've seen a few instances where the number in that element is not correct.
Another approach could be to use win32com.client
to open the file, paginate it, make your search and then return the results in the format you want it.
You can find an example of the syntax in this answer:
from win32com.client import Dispatch
#open Word
word = Dispatch('Word.Application')
word.Visible = False
word = word.Documents.Open(doc_path)
#get number of sheets
word.Repaginate()
num_of_sheets = word.ComputeStatistics(2)
You can also have a look to this answer regarding find and replace in a word document using win32com.client.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With