how to create a dataframe from a table in a word document (.docx) file using pandas

Tags:

I have a word file (.docx) with table of data, I am trying to create a pandas data frame using that table, I have used docx and pandas module. But I could not create a data frame.

from docx import Document
document = Document('req.docx')
for table in document.tables:
    for row in table.rows:       
        for cell in row.cells:        
            print (cell.text)

and also tried to read table as df pd.read_table("path of the file")

I can read the data cell by cell but I want to read the entire table or any particular column. Thanks in advance

958

asked Dec 26 '17 10:12

Vicky

1 Answers

docx always reads data from Word tables as text (strings).

If we want to parse data with correct dtypes we can do one of the following:

manually specify dtype for all columns (not flexible)
write our own code to guess correct dtypes (too difficult and , Pandas IO methods do it well)
convert data into CSV format and let pd.read_csv() guess/infer correct dtypes (I've chosen this way)

Many thanks to @Anton vBR for improving the function!

import pandas as pd
import io
import csv
from docx import Document

def read_docx_tables(filename, tab_id=None, **kwargs):
    """
    parse table(s) from a Word Document (.docx) into Pandas DataFrame(s)

    Parameters:
        filename:   file name of a Word Document

        tab_id:     parse a single table with the index: [tab_id] (counting from 0).
                    When [None] - return a list of DataFrames (parse all tables)

        kwargs:     arguments to pass to `pd.read_csv()` function

    Return: a single DataFrame if tab_id != None or a list of DataFrames otherwise
    """
    def read_docx_tab(tab, **kwargs):
        vf = io.StringIO()
        writer = csv.writer(vf)
        for row in tab.rows:
            writer.writerow(cell.text for cell in row.cells)
        vf.seek(0)
        return pd.read_csv(vf, **kwargs)

    doc = Document(filename)
    if tab_id is None:
        return [read_docx_tab(tab, **kwargs) for tab in doc.tables]
    else:
        try:
            return read_docx_tab(doc.tables[tab_id], **kwargs)
        except IndexError:
            print('Error: specified [tab_id]: {}  does not exist.'.format(tab_id))
            raise

NOTE: you may want to add more checks and exception catching...

Examples:

In [209]: dfs = read_docx_tables(fn)

In [210]: dfs[0]
Out[210]:
   A   B               C,X
0  1  B1                C1
1  2  B2                C2
2  3  B3  val1, val2, val3

In [211]: dfs[0].dtypes
Out[211]:
A       int64
B      object
C,X    object
dtype: object

In [212]: dfs[0].columns
Out[212]: Index(['A', 'B', 'C,X'], dtype='object')

In [213]: dfs[1]
Out[213]:
   C1  C2          C3    Text column
0  11  21         NaN  Test "quotes"
1  12  23  2017-12-31            NaN

In [214]: dfs[1].dtypes
Out[214]:
C1              int64
C2              int64
C3             object
Text column    object
dtype: object

In [215]: dfs[1].columns
Out[215]: Index(['C1', 'C2', 'C3', 'Text column'], dtype='object')

parsing dates:

In [216]: df = read_docx_tables(fn, tab_id=1, parse_dates=['C3'])

In [217]: df
Out[217]:
   C1  C2         C3    Text column
0  11  21        NaT  Test "quotes"
1  12  23 2017-12-31            NaN

In [218]: df.dtypes
Out[218]:
C1                      int64
C2                      int64
C3             datetime64[ns]
Text column            object
dtype: object

151

answered Sep 28 '22 08:09

MaxU - stop WAR against UA

Related questions
                            
                                Convert datetime to time in python
                            
                                Find column name in pandas that matches an array
                            
                                prevent duplicate celery logging
                            
                                properly mock celery task that is being called inside another celery task
                            
                                python -m: Error while finding module specification
                            
                                zsh: command not found: flake8 but flake8 is installed
                            
                                Combine pandas string columns with missing values
                            
                                How to wrap text for an entire column using pandas?
                            
                                Neither DSN nor SERVER keyword supplied
                            
                                Alias for column in pandas
                            
                                How to print the first ten elements from Counter in python
                            
                                Display list result horizontally in python 3 rather than vertically
                            
                                Finding longest perfect match between two strings
                            
                                Remove empty entries during string split
                            
                                Python find numbers between range in list or array
                            
                                batch_input_shape tuple on Keras LSTM
                            
                                How to read the text from the alert box using Python + Selenium
                            
                                Forcing dict keys to be used as argument specifiers with str.format
                            
                                Why do scipy and numpy fft plots look different?
                            
                                Python Logging: Change "WARN" to "INFO"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to create a dataframe from a table in a word document (.docx) file using pandas

Tags:

python

pandas

dataframe

python-docx

docx

Vicky

People also ask

1 Answers

MaxU - stop WAR against UA

Recent Activity

Donate For Us