Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fix DataFrame columns when reading an Excel file with a header with merged cells

I want to read with Python Pandas an Excel file which looks like this:

Excel file screenshot https://www.dropbox.com/s/1usfr3fxfy2qlpp/header_with_merged_cells.xlsx?dl=0

We can see that this Excel file have a header with merged cells

I did

import pandas as pd

df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3)

print(df)
print(df.dtypes)
print(df.columns)

it returns a DataFrame like:

        ColA ColB ColC  Unnamed: 3           Unnamed: 4 ColD
0        NaT  NaN    1         2.0                    3  NaN
1 2010-01-01    A    A         2.1  2010-02-01 00:00:00    E
2 2010-01-02    B    C         2.2  2010-02-02 00:00:00    F

dtypes like:

ColA          datetime64[ns]
ColB                  object
ColC                  object
Unnamed: 3           float64
Unnamed: 4            object
ColD                  object

columns like:

Index(['ColA', 'ColB', 'ColC', 'Unnamed: 3', 'Unnamed: 4', 'ColD'], dtype='object')

Is there a way to fix columns to get ColA, ColB, ColC.1, ColC.2, ColC.3, ColD or MultiIndex columns ?

One issue is that D5 cell is considered as float (instead of int or str) an other issue is that E column should be considered as datetime64[ns]

header parameter of `read_excel can help:

df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3, header=[0,1])

but we get a DataFrame like:

ColA                     ColB ColC                               ColD
           Unnamed: 0_level_1    1    2          3 Unnamed: 4_level_1
2010-01-01                  A    A  2.1 2010-02-01                  E
2010-01-02                  B    C  2.2 2010-02-02                  F

dtypes like:

ColA
ColB  Unnamed: 0_level_1            object
ColC  1                             object
      2                            float64
      3                     datetime64[ns]
ColD  Unnamed: 4_level_1            object
dtype: object

columns like:

MultiIndex(levels=[['ColB', 'ColC', 'ColD'], [1, 2, 3, 'Unnamed: 0_level_1', 'Unnamed: 4_level_1']],
           labels=[[0, 1, 1, 1, 2], [3, 0, 1, 2, 4]],
           names=['ColA', None])

That's odd to see columns such as Unnamed: 0_level_1, Unnamed: 4_level_1. Isn't there a way to fix it?

like image 283
scls Avatar asked Feb 09 '17 09:02

scls


People also ask

How to read multiple columns from multi-line header in pandas?

and to read multiple columns from multi-line header we can use: Pandas can read excel sheets with multiple headers the same way as the CSV files. Below you can find the code for reading multiple headers from excel file: Where the file name is: multine_header.xlsx, the sheet name is multine_header.

How to read multiple headers from Excel file in Python?

Below you can find the code for reading multiple headers from excel file: Where the file name is: multine_header.xlsx, the sheet name is multine_header. To learn more about reading Excel files with Python and Pandas please check: Read Excel XLS with Python Pandas

How can I get pandas to understand merged cells?

How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column.

How to handle missing data when reading an Excel file?

Code #4 : Handling missing data using ‘na_values’ parameter of the read_excel () method. Code #5 : Skip starting rows when Reading an Excel File using ‘skiprows’ parameter of read_excel () method. Code #6 : Set the header to any row and start reading from that row, using ‘header’ parameter of the read_excel () method.


1 Answers

It is not easy.

First add parameter header for creating MultiIndex and then rename Unnamed column names to empty strings.

df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3, header=[0,1])
df = df.reset_index()
df = df.rename(columns=lambda x: x if not 'Unnamed' in str(x) else '')
df = df.rename(columns={'index':'ColA'})
df.columns.names = (None, None)
print(df)
        ColA ColB ColC                 ColD
                     1    2          3     
0 2010-01-01    A    A  2.1 2010-02-01    E
1 2010-01-02    B    C  2.2 2010-02-02    F
like image 97
jezrael Avatar answered Sep 19 '22 10:09

jezrael