How do I extract data from a doc/docx file using Python

Tags:

I know there are similar questions out there, but I couldn't find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and save it in an XML file. Reading up on python-docx did not help, as it only seems to allow one to write into word documents, rather than read. To present my task exactly (or how i chose to approach my task): I would like to search for a key word or phrase in the document (the document contains tables) and extract text data from the table where the key word/phrase is found. Anybody have any ideas?

399

asked Mar 31 '14 07:03

Stefan Urziceanu

2 Answers

The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.

The advantage of this technique is that you don't need any extra python libraries installed.

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))

See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python? for more details and references.

In answer to a comment below, Images are not as clear cut to extract. I have created an empty docx and inserted one image into it. I then open the docx file as a zip archive (using 7zip) and looked at the document.xml. All the image information is stored as attributes in the XML not the CDATA like the text is. So you need to find the tag you are interested in and pull out the information that you are looking for.

For example adding to the script above:

IMAGE = '{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}' + 'docPr'

for image in tree.iter(IMAGE):
    print image.attrib

outputs:

{'id': '1', 'name': 'Picture 1'}

I'm no expert at the openxml format but I hope this helps.

I do note that the zip file contains a directory called media which contains a file called image1.jpeg that contains a renamed copy of my embedded image. You can look around in the docx zipfile to investigate what is available.

answered Sep 19 '22 07:09

Mike Robins

To search in a document with python-docx

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

You also have a function to get the text of a document:

https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)

Using https://github.com/mikemaccana/python-docx

answered Sep 22 '22 07:09

edi9999

Related questions
                            
                                Python count all possible combinations for a table
                            
                                How do you override BaseHTTPRequestHandler log_message() method to log to a file rather than to console (sys.stderr)?
                            
                                connecting to mysql db on amazon rds
                            
                                Cython: Should I use np.float_t rather than double for typed memory views
                            
                                Is it possible to access current object while doing list/dict comprehension in Python?
                            
                                Why does mixing types in Python struct.pack uses more space than needed?
                            
                                Pandas, how to filter a df to get unique entries?
                            
                                Pythonic syntax to concatenate the keys and values of a dictionary
                            
                                Fit data to all possible distributions and return the best fit [closed]
                            
                                Current value of generator
                            
                                'Zip' dictionary of lists in Python
                            
                                Django: How to allow a Suspicious File Operation / copy a file
                            
                                Matplotlib / python clickable points
                            
                                How to make a window with buttons in python
                            
                                cv2.imshow and cv2.imwrite
                            
                                ConfigParser.MissingSectionHeaderError when parsing rsyncd config file with global options
                            
                                Multi-index dataframe from sequence of dataframes [duplicate]
                            
                                Python: I'm trying to find the maximum difference between two elements in a list
                            
                                Creating new binary columns from single string column in pandas
                            
                                Python print unicode list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I extract data from a doc/docx file using Python

Tags:

python

ms-word

docx

doc

Stefan Urziceanu

People also ask

2 Answers

Mike Robins

edi9999

Recent Activity

Donate For Us