Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to create a dataframe from a table in a word document (.docx) file using pandas

I have a word file (.docx) with table of data, I am trying to create a pandas data frame using that table, I have used docx and pandas module. But I could not create a data frame.

from docx import Document
document = Document('req.docx')
for table in document.tables:
    for row in table.rows:       
        for cell in row.cells:        
            print (cell.text)

and also tried to read table as df pd.read_table("path of the file")

I can read the data cell by cell but I want to read the entire table or any particular column. Thanks in advance

like image 958
Vicky Avatar asked Dec 26 '17 10:12

Vicky


People also ask

How do I read a .docx file with pandas?

Process the table data to pandas dataframe. Using the in-built attributes of python-docx library, read each rows of the table and retrieve the text from each cells and create python list of list containing each row. Then convert that python data structure to pandas DataFrame.


1 Answers

docx always reads data from Word tables as text (strings).

If we want to parse data with correct dtypes we can do one of the following:

  • manually specify dtype for all columns (not flexible)
  • write our own code to guess correct dtypes (too difficult and , Pandas IO methods do it well)
  • convert data into CSV format and let pd.read_csv() guess/infer correct dtypes (I've chosen this way)

Many thanks to @Anton vBR for improving the function!


import pandas as pd
import io
import csv
from docx import Document

def read_docx_tables(filename, tab_id=None, **kwargs):
    """
    parse table(s) from a Word Document (.docx) into Pandas DataFrame(s)

    Parameters:
        filename:   file name of a Word Document

        tab_id:     parse a single table with the index: [tab_id] (counting from 0).
                    When [None] - return a list of DataFrames (parse all tables)

        kwargs:     arguments to pass to `pd.read_csv()` function

    Return: a single DataFrame if tab_id != None or a list of DataFrames otherwise
    """
    def read_docx_tab(tab, **kwargs):
        vf = io.StringIO()
        writer = csv.writer(vf)
        for row in tab.rows:
            writer.writerow(cell.text for cell in row.cells)
        vf.seek(0)
        return pd.read_csv(vf, **kwargs)

    doc = Document(filename)
    if tab_id is None:
        return [read_docx_tab(tab, **kwargs) for tab in doc.tables]
    else:
        try:
            return read_docx_tab(doc.tables[tab_id], **kwargs)
        except IndexError:
            print('Error: specified [tab_id]: {}  does not exist.'.format(tab_id))
            raise

NOTE: you may want to add more checks and exception catching...

Examples:

In [209]: dfs = read_docx_tables(fn)

In [210]: dfs[0]
Out[210]:
   A   B               C,X
0  1  B1                C1
1  2  B2                C2
2  3  B3  val1, val2, val3

In [211]: dfs[0].dtypes
Out[211]:
A       int64
B      object
C,X    object
dtype: object

In [212]: dfs[0].columns
Out[212]: Index(['A', 'B', 'C,X'], dtype='object')

In [213]: dfs[1]
Out[213]:
   C1  C2          C3    Text column
0  11  21         NaN  Test "quotes"
1  12  23  2017-12-31            NaN

In [214]: dfs[1].dtypes
Out[214]:
C1              int64
C2              int64
C3             object
Text column    object
dtype: object

In [215]: dfs[1].columns
Out[215]: Index(['C1', 'C2', 'C3', 'Text column'], dtype='object')

parsing dates:

In [216]: df = read_docx_tables(fn, tab_id=1, parse_dates=['C3'])

In [217]: df
Out[217]:
   C1  C2         C3    Text column
0  11  21        NaT  Test "quotes"
1  12  23 2017-12-31            NaN

In [218]: df.dtypes
Out[218]:
C1                      int64
C2                      int64
C3             datetime64[ns]
Text column            object
dtype: object
like image 151
MaxU - stop WAR against UA Avatar answered Sep 28 '22 08:09

MaxU - stop WAR against UA