Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read excel sheet with multiple header using Pandas

I have an excel sheet with multiple header like:

_________________________________________________________________________
____|_____|        Header1    |        Header2     |        Header3      |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1   | ds  | 5  | 6  |9   |10  | .......................................
2   | dh  |  ..........................................................
3   | ge  |  ..........................................................
4   | ew  |  ..........................................................
5   | er  |  ..........................................................

Now here you can see that first two columns do not have headers they are blank but other columns have headers like Header1, Header2 and Header3. So I want to read this sheet and merge it with other sheet with similar structure.

I want to merge it on first column 'ColX'. Right now I am doing this:

import pandas as pd

totalMergedSheet = pd.DataFrame([1,2,3,4,5], columns=['ColX'])
file = pd.ExcelFile('ExcelFile.xlsx')
for i in range (1, len(file.sheet_names)):
    df1 = file.parse(file.sheet_names[i-1])
    df2 = file.parse(file.sheet_names[i])
    newMergedSheet = pd.merge(df1, df2, on='ColX')
    totalMergedSheet = pd.merge(totalMergedSheet, newMergedSheet, on='ColX')

But I don't know its neither reading columns correctly and I think will not return the results in the way I want. So, I want the resulting frame should be like:

________________________________________________________________________________________________________
____|_____|        Header1    |        Header2     |        Header3      |        Header4     |        Header5      |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColK| ColL|ColM|ColN|ColO||ColP|ColQ|ColR|ColS|
1   | ds  | 5  | 6  |9   |10  | ..................................................................................
2   | dh  |  ...................................................................................
3   | ge  |  ....................................................................................
4   | ew  |  ...................................................................................
5   | er  |  ......................................................................................

Any suggestions please. Thanks.

like image 493
muazfaiz Avatar asked Nov 11 '16 18:11

muazfaiz


People also ask

Can pandas read Excel file with multiple sheets?

To read an excel file as a DataFrame, use the pandas read_excel() method. You can read the first sheet, specific sheets, multiple sheets or all sheets.

Can pandas Read_csv read Excel?

One crucial feature of Pandas is its ability to write and read Excel, CSV, and many other types of files. Functions like the Pandas read_csv() method enable you to work with files effectively.


1 Answers

[See comments for updates and corrections]

Pandas already has a function that will read in an entire Excel spreadsheet for you, so you don't need to manually parse/merge each sheet. Take a look pandas.read_excel(). It not only lets you read in an Excel file in a single line, it also provides options to help solve the problem you're having.

Since you have subcolumns, what you're looking for is MultiIndexing. By default, pandas will read in the top row as the sole header row. You can pass a header argument into pandas.read_excel() that indicates how many rows are to be used as headers. In your particular case, you'd want header=[0, 1], indicating the first two rows. You might also have multiple sheets, so you can pass sheetname=None as well (this tells it to go through all sheets). The command would be:

df_dict = pandas.read_excel('ExcelFile.xlsx', header=[0, 1], sheetname=None)

This returns a dictionary where the keys are the sheet names, and the values are the DataFrames for each sheet. If you want to collapse it all into one DataFrame, you can simply use pandas.concat:

df = pandas.concat(df_dict.values(), axis=0)
like image 54
beeftendon Avatar answered Oct 04 '22 15:10

beeftendon