I want to read with Python Pandas an Excel file which looks like this:
https://www.dropbox.com/s/1usfr3fxfy2qlpp/header_with_merged_cells.xlsx?dl=0
We can see that this Excel file have a header with merged cells
I did
import pandas as pd
df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3)
print(df)
print(df.dtypes)
print(df.columns)
it returns a DataFrame like:
ColA ColB ColC Unnamed: 3 Unnamed: 4 ColD
0 NaT NaN 1 2.0 3 NaN
1 2010-01-01 A A 2.1 2010-02-01 00:00:00 E
2 2010-01-02 B C 2.2 2010-02-02 00:00:00 F
dtypes
like:
ColA datetime64[ns]
ColB object
ColC object
Unnamed: 3 float64
Unnamed: 4 object
ColD object
columns
like:
Index(['ColA', 'ColB', 'ColC', 'Unnamed: 3', 'Unnamed: 4', 'ColD'], dtype='object')
Is there a way to fix columns to get ColA, ColB, ColC.1, ColC.2, ColC.3, ColD
or MultiIndex columns ?
One issue is that D5 cell is considered as float
(instead of int
or str
)
an other issue is that E column should be considered as datetime64[ns]
header
parameter of `read_excel can help:
df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3, header=[0,1])
but we get a DataFrame like:
ColA ColB ColC ColD
Unnamed: 0_level_1 1 2 3 Unnamed: 4_level_1
2010-01-01 A A 2.1 2010-02-01 E
2010-01-02 B C 2.2 2010-02-02 F
dtypes
like:
ColA
ColB Unnamed: 0_level_1 object
ColC 1 object
2 float64
3 datetime64[ns]
ColD Unnamed: 4_level_1 object
dtype: object
columns
like:
MultiIndex(levels=[['ColB', 'ColC', 'ColD'], [1, 2, 3, 'Unnamed: 0_level_1', 'Unnamed: 4_level_1']],
labels=[[0, 1, 1, 1, 2], [3, 0, 1, 2, 4]],
names=['ColA', None])
That's odd to see columns such as Unnamed: 0_level_1
, Unnamed: 4_level_1
.
Isn't there a way to fix it?
and to read multiple columns from multi-line header we can use: Pandas can read excel sheets with multiple headers the same way as the CSV files. Below you can find the code for reading multiple headers from excel file: Where the file name is: multine_header.xlsx, the sheet name is multine_header.
Below you can find the code for reading multiple headers from excel file: Where the file name is: multine_header.xlsx, the sheet name is multine_header. To learn more about reading Excel files with Python and Pandas please check: Read Excel XLS with Python Pandas
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column.
Code #4 : Handling missing data using ‘na_values’ parameter of the read_excel () method. Code #5 : Skip starting rows when Reading an Excel File using ‘skiprows’ parameter of read_excel () method. Code #6 : Set the header to any row and start reading from that row, using ‘header’ parameter of the read_excel () method.
It is not easy.
First add parameter header
for creating MultiIndex
and then rename Unnamed
column names to empty strings.
df = pd.read_excel("header_with_merged_cells.xlsx", skiprows=3, header=[0,1])
df = df.reset_index()
df = df.rename(columns=lambda x: x if not 'Unnamed' in str(x) else '')
df = df.rename(columns={'index':'ColA'})
df.columns.names = (None, None)
print(df)
ColA ColB ColC ColD
1 2 3
0 2010-01-01 A A 2.1 2010-02-01 E
1 2010-01-02 B C 2.2 2010-02-02 F
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With