Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Loop through Excel sheets, place into one df

Tags:

I have an excel file foo.xlsx with about 40 sheets sh1, sh2, etc. Each sheet has the format:

area      cnt   name\nparty1   name\nparty2 blah      9         5               5 word      3         7               5 

In each sheet I want to rename the vars with the format name\nparty to only have the party as a label. Example output:

area      cnt    party1    party2     sheet bacon     9         5         5        sh1 spam      3         7         5        sh1 eggs      2         18        4        sh2 

I am reading in the file with:

book = pd.ExcelFile(path)  

And then wondering if I need to do:

for f in filelist:     df = pd.ExcelFile.parse(book,sheetname=??)     'more operations here'     # only change column names 2 and 3      i, col in enumerate(df):      if i>=2 and i<=3:         new_col_name = col.split("\n")[-1]         df[new_col_name] = 

Or something like that?

like image 311
Yolo_chicken Avatar asked Jun 14 '17 15:06

Yolo_chicken


People also ask

How do I combine multiple Excel worksheets into one?

On the Data tab, under Tools, click Consolidate. In the Function box, click the function that you want Excel to use to consolidate the data. In each source sheet, select your data, and then click Add. The file path is entered in All references.

How do you loop in Excel using Python?

The openpyxl module allows a Python program to read and modify Excel files. We will be using this excel worksheet in the below examples: Approach #1: We will create an object of openpyxl, and then we'll iterate through all rows from top to bottom.


1 Answers

The read_excel method of pandas lets you read all sheets in at once if you set the keyword parameter sheet_name=None (in some older versions of pandas this was called sheetname). This returns a dictionary - the keys are the sheet names, and the values are the sheets as dataframes.

Using this, we can simply loop through the dictionary and:

  1. Add an extra column to the dataframes containing the relevant sheetname
  2. Use the rename method to rename our columns - by using a lambda, we simply take the final entry of the list obtained by splitting each column name any time there is a new line. If there is no new line, the column name is unchanged.
  3. Append to a list, to be combined at the end.

Once this is done, we combine all the sheets into one with pd.concat. Then we reset the index and all should be well. Note: if you have parties present on one sheet but not others, this will still work but will fill any missing columns for each sheet with NaN.

import pandas as pd  sheets_dict = pd.read_excel('Book1.xlsx', sheet_name=None)  all_sheets = [] for name, sheet in sheets_dict.items():     sheet['sheet'] = name     sheet = sheet.rename(columns=lambda x: x.split('\n')[-1])     all_sheets.append(sheet)  full_table = pd.concat(all_sheets) full_table.reset_index(inplace=True, drop=True)  print(full_table) 

Prints:

    area  cnt  party1  party2   sheet 0  bacon    9       5       5  Sheet1 1   spam    3       7       5  Sheet1 2   eggs    2      18       4  Sheet2 
like image 92
asongtoruin Avatar answered Oct 03 '22 12:10

asongtoruin