Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Reading Excel with merged cells

I have Excel files with multiple sheets, each of which looks a little like this (but much longer):

        Sample  CD4     CD8 Day 1   8311    17.3    6.44         8312    13.6    3.50         8321    19.8    5.88         8322    13.5    4.09 Day 2   8311    16.0    4.92         8312    5.67    2.28         8321    13.0    4.34         8322    10.6    1.95 

The first column is actually four cells merged vertically.

When I read this using pandas.read_excel, I get a DataFrame that looks like this:

       Sample    CD4   CD8 Day 1    8311  17.30  6.44 NaN      8312  13.60  3.50 NaN      8321  19.80  5.88 NaN      8322  13.50  4.09 Day 2    8311  16.00  4.92 NaN      8312   5.67  2.28 NaN      8321  13.00  4.34 NaN      8322  10.60  1.95 

How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)

like image 522
iayork Avatar asked Apr 08 '14 12:04

iayork


People also ask

How do you read merged Excel cells with Nan into pandas DataFrame?

To fill all the NaNs in a DataFrame row-wise, you could use df = df. fillna(method='ffill', axis=1) . To fill only selected rows, use df. loc or df.

How do I read merged cells in Excel using Openpyxl?

cell = sheet. cell(row=15, column=14) if type(cell). __name__ == 'MergedCell': print("Oh no, the cell is merged!") else: print("This cell is not merged.") Openpyxl seems to have changed sense this answer was posted.


2 Answers

You could use the Series.fillna method to forword-fill in the NaN values:

df.index = pd.Series(df.index).fillna(method='ffill') 

For example,

In [42]: df Out[42]:         Sample    CD4   CD8 Day 1    8311  17.30  6.44 NaN      8312  13.60  3.50 NaN      8321  19.80  5.88 NaN      8322  13.50  4.09 Day 2    8311  16.00  4.92 NaN      8312   5.67  2.28 NaN      8321  13.00  4.34 NaN      8322  10.60  1.95  [8 rows x 3 columns]  In [43]: df.index = pd.Series(df.index).fillna(method='ffill')  In [44]: df Out[44]:         Sample    CD4   CD8 Day 1    8311  17.30  6.44 Day 1    8312  13.60  3.50 Day 1    8321  19.80  5.88 Day 1    8322  13.50  4.09 Day 2    8311  16.00  4.92 Day 2    8312   5.67  2.28 Day 2    8321  13.00  4.34 Day 2    8322  10.60  1.95  [8 rows x 3 columns] 
like image 126
unutbu Avatar answered Sep 21 '22 15:09

unutbu


df = df.fillna(method='ffill', axis=0)  # resolved updating the missing row entries 
like image 43
Muth Avatar answered Sep 20 '22 15:09

Muth