I want to read with Python Pandas an Excel file which looks like this:
https://www.dropbox.com/s/1usfr3fxfy2qlpp/header_with_merged_cells.xlsx?dl=0
We can see that this Excel file have a header with merged cells
I did
import pandas as pd
df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3)
print(df)
print(df.dtypes)
print(df.columns)
it returns a DataFrame like:
        ColA ColB ColC  Unnamed: 3           Unnamed: 4 ColD
0        NaT  NaN    1         2.0                    3  NaN
1 2010-01-01    A    A         2.1  2010-02-01 00:00:00    E
2 2010-01-02    B    C         2.2  2010-02-02 00:00:00    F
dtypes like:
ColA          datetime64[ns]
ColB                  object
ColC                  object
Unnamed: 3           float64
Unnamed: 4            object
ColD                  object
columns like:
Index(['ColA', 'ColB', 'ColC', 'Unnamed: 3', 'Unnamed: 4', 'ColD'], dtype='object')
Is there a way to fix columns to get ColA, ColB, ColC.1, ColC.2, ColC.3, ColD or MultiIndex columns ?
One issue is that D5 cell is considered as float (instead of int or str)
an other issue is that E column should be considered as datetime64[ns]
header parameter of `read_excel can help:
df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3, header=[0,1])
but we get a DataFrame like:
ColA                     ColB ColC                               ColD
           Unnamed: 0_level_1    1    2          3 Unnamed: 4_level_1
2010-01-01                  A    A  2.1 2010-02-01                  E
2010-01-02                  B    C  2.2 2010-02-02                  F
dtypes like:
ColA
ColB  Unnamed: 0_level_1            object
ColC  1                             object
      2                            float64
      3                     datetime64[ns]
ColD  Unnamed: 4_level_1            object
dtype: object
columns like:
MultiIndex(levels=[['ColB', 'ColC', 'ColD'], [1, 2, 3, 'Unnamed: 0_level_1', 'Unnamed: 4_level_1']],
           labels=[[0, 1, 1, 1, 2], [3, 0, 1, 2, 4]],
           names=['ColA', None])
That's odd to see columns such as Unnamed: 0_level_1, Unnamed: 4_level_1.
Isn't there a way to fix it?
and to read multiple columns from multi-line header we can use: Pandas can read excel sheets with multiple headers the same way as the CSV files. Below you can find the code for reading multiple headers from excel file: Where the file name is: multine_header.xlsx, the sheet name is multine_header.
Below you can find the code for reading multiple headers from excel file: Where the file name is: multine_header.xlsx, the sheet name is multine_header. To learn more about reading Excel files with Python and Pandas please check: Read Excel XLS with Python Pandas
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column.
Code #4 : Handling missing data using ‘na_values’ parameter of the read_excel () method. Code #5 : Skip starting rows when Reading an Excel File using ‘skiprows’ parameter of read_excel () method. Code #6 : Set the header to any row and start reading from that row, using ‘header’ parameter of the read_excel () method.
It is not easy.
First add parameter header for creating MultiIndex and then rename Unnamed column names to empty strings.
df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3, header=[0,1])
df = df.reset_index()
df = df.rename(columns=lambda x: x if not 'Unnamed' in str(x) else '')
df = df.rename(columns={'index':'ColA'})
df.columns.names = (None, None)
print(df)
        ColA ColB ColC                 ColD
                     1    2          3     
0 2010-01-01    A    A  2.1 2010-02-01    E
1 2010-01-02    B    C  2.2 2010-02-02    F
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With