Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Parsing of table from .docx file [closed]

I want to parse a table from a .docx file using Python and python-docx into some useful data structure.

The .docx file contains only a single table in my case. I've uploaded it so you can have a look. Here's a screenshot:


like image 694
Sreedhar Avatar asked Jan 09 '15 13:01


1 Answers

You can use the snippet below to parse your document into a list where each row is a dictionary mapping the table header value to the column value.

from docx.api import Document

# Load the first table from your document. In your example file,
# there is only one table, so I just grab the first one.
document = Document('Books.docx')
table = document.tables[0]

# Data will be a list of rows represented as dictionaries
# containing each row's data.
data = []

keys = None
for i, row in enumerate(table.rows):
    text = (cell.text for cell in row.cells)

    # Establish the mapping based on the first row
    # headers; these will become the keys of our dictionary
    if i == 0:
        keys = tuple(text)

    # Construct a dictionary for this row, mapping
    # keys to values for this row
    row_data = dict(zip(keys, text))

This will give you:

data = [
  {u'Pub.': u'Penguin Books',
   u'Auther': u'Edward de BONO',
   u'Sr. No.': u'1',
   u'Name of Book': u'Six Thinking Hats'

If you'd just want a tuple for each row, you should instead of creating a dictionary just set row_data to the tuple value of text, so in the loop instead of constructing the dict, do:

# Construct a tuple for this row
row_data = tuple(text)

Now, data would hold something like this instead:

data = [
   u'Six Thinking Hats',
   u'Edward de BONO',
   u'Penguin Books'

Then you can skip constructing keys, obviously (but still skip the first row!).

like image 169
vicvicvic Avatar answered Oct 09 '22 19:10
