Fix DataFrame columns when reading an Excel file with a header with merged cells

Tags:

I want to read with Python Pandas an Excel file which looks like this:

Excel file screenshot https://www.dropbox.com/s/1usfr3fxfy2qlpp/header_with_merged_cells.xlsx?dl=0

We can see that this Excel file have a header with merged cells

I did

import pandas as pd

df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3)

print(df)
print(df.dtypes)
print(df.columns)

it returns a DataFrame like:

        ColA ColB ColC  Unnamed: 3           Unnamed: 4 ColD
0        NaT  NaN    1         2.0                    3  NaN
1 2010-01-01    A    A         2.1  2010-02-01 00:00:00    E
2 2010-01-02    B    C         2.2  2010-02-02 00:00:00    F

dtypes like:

ColA          datetime64[ns]
ColB                  object
ColC                  object
Unnamed: 3           float64
Unnamed: 4            object
ColD                  object

columns like:

Index(['ColA', 'ColB', 'ColC', 'Unnamed: 3', 'Unnamed: 4', 'ColD'], dtype='object')

Is there a way to fix columns to get ColA, ColB, ColC.1, ColC.2, ColC.3, ColD or MultiIndex columns ?

One issue is that D5 cell is considered as float (instead of int or str) an other issue is that E column should be considered as datetime64[ns]

header parameter of `read_excel can help:

df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3, header=[0,1])

but we get a DataFrame like:

ColA                     ColB ColC                               ColD
           Unnamed: 0_level_1    1    2          3 Unnamed: 4_level_1
2010-01-01                  A    A  2.1 2010-02-01                  E
2010-01-02                  B    C  2.2 2010-02-02                  F

dtypes like:

ColA
ColB  Unnamed: 0_level_1            object
ColC  1                             object
      2                            float64
      3                     datetime64[ns]
ColD  Unnamed: 4_level_1            object
dtype: object

columns like:

MultiIndex(levels=[['ColB', 'ColC', 'ColD'], [1, 2, 3, 'Unnamed: 0_level_1', 'Unnamed: 4_level_1']],
           labels=[[0, 1, 1, 1, 2], [3, 0, 1, 2, 4]],
           names=['ColA', None])

That's odd to see columns such as Unnamed: 0_level_1, Unnamed: 4_level_1. Isn't there a way to fix it?

283

asked Feb 09 '17 09:02

scls

1 Answers

It is not easy.

First add parameter header for creating MultiIndex and then rename Unnamed column names to empty strings.

df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3, header=[0,1])
df = df.reset_index()
df = df.rename(columns=lambda x: x if not 'Unnamed' in str(x) else '')
df = df.rename(columns={'index':'ColA'})
df.columns.names = (None, None)
print(df)
        ColA ColB ColC                 ColD
                     1    2          3     
0 2010-01-01    A    A  2.1 2010-02-01    E
1 2010-01-02    B    C  2.2 2010-02-02    F

answered Sep 19 '22 10:09

jezrael

Related questions
                            
                                Realtime data stream to Python from CSV file
                            
                                NEAT-Python not finding Visualize.py
                            
                                How to append item to list of different column in Pandas
                            
                                What are 'screen units' in tkinter?
                            
                                Numpy inverts a non-invertible matrix
                            
                                Remove jumps like peaks and steps in timeseries
                            
                                Using typing module in Python 2.7
                            
                                How to store datetime with millisecond precision in SQL database
                            
                                How to keep PyQt Grid elements from resizing and maintain even spacing of all widgets?
                            
                                Vectorised way to query date and price data
                            
                                Counting non-overlapping runs of non-zero values by row in a DataFrame
                            
                                Align numpy array according to another array
                            
                                How to read two lines from a file and create dynamics keys in a for-loop?
                            
                                Python installed for all users or current user only?
                            
                                Python: using map and multiprocessing
                            
                                Pandas groupby hour of the day to dictionary
                            
                                Fast numpy roll
                            
                                Plotting wind vectors on vertical cross-section with matplotlib
                            
                                In Python how to do Correlation between Multiple Columns more than 2 variables?
                            
                                ValueError on Python Enum when comma seperated [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fix DataFrame columns when reading an Excel file with a header with merged cells

Tags:

python

pandas

excel

scls

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us