Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read excel sheet with multiple header when first column is empty

I have an excel sheet like this:

Excel sheet

I want to read it with pandas read_excel and I tried this:

df = pd.read_excel("test.xlsx", header=[0,1])

but it throws me this error:

ParserError: Passed header=[0,1] are too many rows for this multi_index of columns

Any suggestions?

like image 586
Alexandra Espichán Avatar asked May 22 '18 16:05

Alexandra Espichán


2 Answers

If you don't mind massaging the DataFrame after reading the Excel you can try the below two ways:

>>> pd.read_excel("/tmp/sample.xlsx", usecols = "B:F", skiprows=[0])
  header1 Unnamed: 1 Unnamed: 2 header2 Unnamed: 4
0    col1       col2       col3    col4       col5
1       a          0          x       3          d
2       b          1          y       4          e
3       c          2          z       5          f

In above, you'd have to fix the first level of the MultiIndex since header1 and header2 are merged cells

>>> pd.read_excel("/tmp/sample.xlsx", header=[0,1], usecols = "B:F", 
skiprows=[0])
        header1      header2
header1    col1 col2    col3 col4
a             0    x       3    d
b             1    y       4    e
c             2    z       5    f

In above, it got pretty close by skipping the empty row and parsing only columns (B:F) with data. If you notice, the columns got shifted though...

Note Not a clean solution but just wanted to share samples with you in a post rather than a comment

-- Edit based on discussion with OP --

Based on documentation for pandas read_excel, header[1,2] is creating a MultiIndex for your columns. Looks like it determines the labels for the DataFrame depending on what is populated in Column A. Since there's nothing there... the index has a bunch of Nan like so

>>> pd.read_excel("/tmp/sample.xlsx", header=[1,2])
    header1           header2
       col1 col2 col3    col4 col5
NaN       a    0    x       3    d
NaN       b    1    y       4    e
NaN       c    2    z       5    f

Again if you're okay with cleaning up columns and if the first column of the xlsx is always blank... you can drop it like below. Hopefully this is what you're looking for.

>>> pd.read_excel("/tmp/sample.xlsx", header[1,2]).reset_index().drop(['index'], level=0, axis=1)
  header1           header2
     col1 col2 col3    col4 col5
0       a    0    x       3    d
1       b    1    y       4    e
2       c    2    z       5    f
like image 131
Orenshi Avatar answered Sep 28 '22 09:09

Orenshi


Here is the documentation on the header parameter:

Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.

I think the following should work:

df = pd.read_excel("test.xlsx", skiprows=2, usecols='B:F', header=0)
like image 31
BallpointBen Avatar answered Sep 28 '22 10:09

BallpointBen