I have an MS Word document contains some text and headings, I want to extract the headings, I installed Python for win32, but I didn't know which method to use, it seems the help document of python for windows does not list the functions of the word obejct. take the following code as example <pre class="prettyprint"><code>import win32com.client as win32 word = win32.Dispatch("Word.Application") word.Visible = 0 word.Documents.Open("MyDocument") doc = word.ActiveDocument </code></pre> how can I know all the functions of the word object？I didn't find anything useful in the help document.

The Word object model can be found here. Your <code>doc</code> object will contain these properties, and you can use them to perform your desired actions (note that I haven't used this feature with Word, so my knowledge of the object model is sparse). For instance, if you wanted to read all the words in a document, you could do: <pre class="prettyprint"><code>for word in doc.Words: print word </code></pre> And you would get all of the words. Each of those <code>word</code> items would be a <code>Word</code> object (reference here), so you could access those properties during iteration. In your case, here is how you would get the style: <pre class="prettyprint"><code>for word in doc.Words: print word.Style </code></pre> On a sample doc with a single Heading 1 and normal text, this prints: <pre class="prettyprint"><code>Heading 1 Heading 1 Heading 1 Heading 1 Heading 1 Normal Normal Normal Normal Normal </code></pre> To group the headings together, you can use <code>itertools.groupby</code>. As explained in the code comments below, you need to reference the <code>str()</code> of the object itself, as using <code>word.Style</code> returns an instance that won't properly group with other instances of the same style: <pre class="prettyprint"><code>from itertools import groupby import win32com.client as win32 # All the same as yours word = win32.Dispatch("Word.Application") word.Visible = 0 word.Documents.Open("testdoc.doc") doc = word.ActiveDocument # Here we use itertools.groupby (without sorting anything) to # find groups of words that share the same heading (note it picks # up newlines). The tricky/confusing thing here is that you can't # just group on the Style itself - you have to group on the str(). # There was some other interesting behavior, but I have zero # experience with COMObjects so I'll leave it there :) # All of these comments for two lines of code :) for heading, grp_wrds in groupby(doc.Words, key=lambda x: str(x.Style)): print heading, ''.join(str(word) for word in grp_wrds) </code></pre> This outputs: <pre class="prettyprint"><code>Heading 1 Here is some text Normal No header </code></pre> If you replace the <code>join</code> with a list comprehension, you get the below (where you can see the newlines): <pre class="prettyprint"><code>Heading 1 ['Here ', 'is ', 'some ', 'text', '\r'] Normal ['\r', 'No ', 'header', '\r', '\r'] </code></pre>

Extract headings from a MS Word document in Python

Tags:

python

ms-word

I have an MS Word document contains some text and headings, I want to extract the headings, I installed Python for win32, but I didn't know which method to use, it seems the help document of python for windows does not list the functions of the word obejct. take the following code as example

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument

how can I know all the functions of the word object？I didn't find anything useful in the help document.

626

asked Jan 09 '13 14:01

zdd

2 Answers

The Word object model can be found here. Your doc object will contain these properties, and you can use them to perform your desired actions (note that I haven't used this feature with Word, so my knowledge of the object model is sparse). For instance, if you wanted to read all the words in a document, you could do:

for word in doc.Words:
    print word

And you would get all of the words. Each of those word items would be a Word object (reference here), so you could access those properties during iteration. In your case, here is how you would get the style:

for word in doc.Words:
    print word.Style

On a sample doc with a single Heading 1 and normal text, this prints:

Heading 1
Heading 1
Heading 1
Heading 1
Heading 1
Normal
Normal
Normal
Normal
Normal

To group the headings together, you can use itertools.groupby. As explained in the code comments below, you need to reference the str() of the object itself, as using word.Style returns an instance that won't properly group with other instances of the same style:

from itertools import groupby
import win32com.client as win32

# All the same as yours
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("testdoc.doc")
doc = word.ActiveDocument

# Here we use itertools.groupby (without sorting anything) to
# find groups of words that share the same heading (note it picks
# up newlines). The tricky/confusing thing here is that you can't
# just group on the Style itself - you have to group on the str(). 
# There was some other interesting behavior, but I have zero 
# experience with COMObjects so I'll leave it there :)
# All of these comments for two lines of code :)
for heading, grp_wrds in groupby(doc.Words, key=lambda x: str(x.Style)):
  print heading, ''.join(str(word) for word in grp_wrds)

This outputs:

Heading 1 Here is some text

Normal 
No header

If you replace the join with a list comprehension, you get the below (where you can see the newlines):

Heading 1 ['Here ', 'is ', 'some ', 'text', '\r']
Normal ['\r', 'No ', 'header', '\r', '\r']

155

answered Oct 11 '22 11:10

RocketDonkey

convert word to docx and use python docx module

from docx import Document

file = 'test.docx'
document = Document(file)

for paragraph in document.paragraphs:
    if paragraph.style.name == 'Heading 1':
        print(paragraph.text)

answered Oct 11 '22 11:10

Pankaj Singh

Related questions
                            
                                Matplotlib plot pulse propagation in 3d
                            
                                Distribute/distutils specify Python version
                            
                                How to use numpy to add any two elements in an array and produce a matrix?
                            
                                Apache SSL vs Python Simple HTTP Server SSL security questions
                            
                                py.test: how to automatically detect an exception in a child process?
                            
                                Python How to use extended path length
                            
                                Creating subplots with differing shapes in matplotlib
                            
                                Embedding Python with C
                            
                                Different behaviour between python console and python script
                            
                                Using the tornado RequestHandler is it possible to get POST data without specifying a argument?
                            
                                User input variables in cx_Oracle?
                            
                                Python Speedup np.unique
                            
                                In nested classes, how to access outer class's elements from nested class in Python?
                            
                                How can i use scrapy shell to with parameters on url
                            
                                Processing a large amount of data in parallel
                            
                                import error due to bs4 vs BeautifulSoup
                            
                                What is the correct way to make SQLalchemy store strings as lowercase?
                            
                                how to crawl a site only given domain url with scrapy
                            
                                Python's glob module and unix' find command don't recognize non-ascii
                            
                                scipy.sparse dot extremely slow in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With