Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract headings from a MS Word document in Python

Tags:

python

ms-word

I have an MS Word document contains some text and headings, I want to extract the headings, I installed Python for win32, but I didn't know which method to use, it seems the help document of python for windows does not list the functions of the word obejct. take the following code as example

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument

how can I know all the functions of the word object?I didn't find anything useful in the help document.

like image 626
zdd Avatar asked Jan 09 '13 14:01

zdd


People also ask

How do I extract text from a file in Python?

To extract text from MS word files in Python, we can use the zipfile library. to create ZipFile object with the path string to the Word file. Then we call read with 'word/document. xml' to read the Word file.

Can Python parse Word document?

You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images. You can install it by running: pip install docx2txt .

How do you separate headings?

Select Layout > Breaks > Next Page. Double-click the header or footer on the first page of the new section. Click Link to Previous to turn it off and unlink the header or footer from the previous section. Note: Headers and footers are linked separately.


2 Answers

The Word object model can be found here. Your doc object will contain these properties, and you can use them to perform your desired actions (note that I haven't used this feature with Word, so my knowledge of the object model is sparse). For instance, if you wanted to read all the words in a document, you could do:

for word in doc.Words:
    print word

And you would get all of the words. Each of those word items would be a Word object (reference here), so you could access those properties during iteration. In your case, here is how you would get the style:

for word in doc.Words:
    print word.Style

On a sample doc with a single Heading 1 and normal text, this prints:

Heading 1
Heading 1
Heading 1
Heading 1
Heading 1
Normal
Normal
Normal
Normal
Normal

To group the headings together, you can use itertools.groupby. As explained in the code comments below, you need to reference the str() of the object itself, as using word.Style returns an instance that won't properly group with other instances of the same style:

from itertools import groupby
import win32com.client as win32

# All the same as yours
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("testdoc.doc")
doc = word.ActiveDocument

# Here we use itertools.groupby (without sorting anything) to
# find groups of words that share the same heading (note it picks
# up newlines). The tricky/confusing thing here is that you can't
# just group on the Style itself - you have to group on the str(). 
# There was some other interesting behavior, but I have zero 
# experience with COMObjects so I'll leave it there :)
# All of these comments for two lines of code :)
for heading, grp_wrds in groupby(doc.Words, key=lambda x: str(x.Style)):
  print heading, ''.join(str(word) for word in grp_wrds)

This outputs:

Heading 1 Here is some text

Normal 
No header

If you replace the join with a list comprehension, you get the below (where you can see the newlines):

Heading 1 ['Here ', 'is ', 'some ', 'text', '\r']
Normal ['\r', 'No ', 'header', '\r', '\r']
like image 155
RocketDonkey Avatar answered Oct 11 '22 11:10

RocketDonkey


convert word to docx and use python docx module

from docx import Document

file = 'test.docx'
document = Document(file)

for paragraph in document.paragraphs:
    if paragraph.style.name == 'Heading 1':
        print(paragraph.text)
like image 28
Pankaj Singh Avatar answered Oct 11 '22 11:10

Pankaj Singh