Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding word on page(s) in document

I am looking for an elegant solution to find on what page(s) in a document a certain word occurs that I have stored in a python dictionary/list.

I first considered .docx format as an input and had a look at PythonDocx which has a search function, but there's obviously not really a pages attribute in the docx/xml format. If I parse the document I could look for <w:br w:type="page"/> occurrences in the xml tree but unfortunately these do not show non-forced page breaks.

I even considered converting files to PDF first and use something like PDFminer to parse the document page-wise.

Is there any straightforward solution to search a .docx document for a string and return the pages it occurs on like

[('foo' ,[1, 4, 7 ]), ('bar', [2]), ('baz', [2, 5, 8, 9 )]
like image 848
birgit Avatar asked Sep 05 '15 22:09

birgit


People also ask

How do you find certain words on a page?

Tip: You can also use the keyboard shortcuts Ctrl + f (Windows, Linux, and ChromeOS) or + f (Mac) to find a word or phrase quickly. You can find more info about a specific word or phrase on a page. Highlight a word or phrase. On a PC: Right-click the highlighted content.

What is the key function used to find a word on a page?

Press Ctrl+F, and then type your search words.

How do you use the navigation pane to find particular words in a document?

Search in your documentIn the search box at the top of the Navigation pane, type the text that you want to find. Click a result to see it in your document, or browse through all the results by clicking the Next Search Result and Previous Search Result arrows.


1 Answers

Parse the xml files composing the docx

It seems that the biggest challenge in your question is how to be able to parse a document page by page. This answer of a word document is not always the same and it depends on the margins, the paper sheet settings, the application you use to open it etc. A good reasoning on the accuracy of any script for this purpose can be found at google group.

However, if you can be satisfied with a almost 100% accurate, you start to find a solution as suggested in this google group: 

I found that I can unzip the .docx file and extract docProps/app.xml, then parse the XML with ElementTree to get the <Pages></Pages> element. I found that most of the time that number is accurate, but I've seen a few instances where the number in that element is not correct.  

Use Win32com.Client

Another approach could be to use win32com.client to open the file, paginate it, make your search and then return the results in the format you want it.

You can find an example of the syntax in this answer:

from win32com.client import Dispatch
#open Word
word = Dispatch('Word.Application')
word.Visible = False
word = word.Documents.Open(doc_path)

#get number of sheets
word.Repaginate()
num_of_sheets = word.ComputeStatistics(2)

You can also have a look to this answer regarding find and replace in a word document using win32com.client.

like image 176
mabe02 Avatar answered Sep 27 '22 18:09

mabe02