Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Columns with 'None' header when importing from xlsx to pandas

Importing a heavily formatted excel worksheet into pandas results in some columns which are entirely blank and have 'None' when viewing df.columns. I need to remove these columns but I'm getting some strange output that makes it hard for me to figure out how exactly to drop them.

****Editing for clarity****

The excel worksheet is heavily formatted and must be reshaped for the data to be used in analysis. In essence, col A is a list of questions and col B is an explanation of each question and col C is the response to the question. The desired result is that col A becomes the header of a tabular dataset, col B is dropped, and col C is the first row. This then needs to be saved in such a way that col C of another copy of the excel worksheet (which would be filled out for another client) can be appended to the tabular data set.

I have been able to import the worksheet into python and pandas, transpose the data, and do some minimal reshaping and cleaning.

example code:

import os
import pandas as pd
import xlwings as xw


dir_path = "C:\\Users\\user.name\\directory\\project\\data\\january"
file_path = "C:\\Users\\user.name\\directory\\project\\data\\january\\D10A0021_10.01.20.xlsx"


os.chdir(dir_path)# setting the directory
wb = xw.Book(file_path, password = 'mypassword') # getting python to open the workbook
demographics = wb.sheets[0] # selecting the demographic sheet. 


df = demographics['B2:D33'].options(pd.DataFrame, index=False, header = True).value # importing all the used cells into pandas
df.columns = [0,1,2] #adding column names that I can track
df = df.T #Transposing the data
df.columns = df.loc[0] #turning the question items into the column headers
df = df.loc[2:] remove the unneeded first and second row from the set


for num, col in enumerate(df.columns):
    print(f'{num}: {col}') # This code has fixed the issue one of the issues. Suggested by Datanovice.  



Output: 
0: Client code
1: Client's date of birth
2: Sex
3: Previous symptom recurrence                               
4: None
5: Has the client attended Primary Care Psychology in the past? 

6: None
7: Ethnicity
8: None
9: Did the parent/ guardian/ carer require help completing the scales due to literacy difficulties?
10: Did the parent/ guardian/ carer require help completing the scales due to perceived complexity of questionnaires?
11: Did the client require help completing the scales due to literacy difficulties?
12: Did the client require help completing the scales due to perceived complexity of questionnaires?
13: Accommodation status  
14: None
15: Relationship with main carer
16: None
17: Any long term stressors
18: Referral source
19: Referral date
20: Referral reason
21: Actual presenting difficulty (post formulation) 
22: Date first seen
23: Discharge date
24: Reason for terminating treatment
25: None
26: Type of intervention
27: Total number of sessions offered (including DNA’s CNA’s)
28: No. of sessions: attended (by type of intervention)
29: No. of sessions: did not attend (by type of intervention)
30: No. of sessions: could not attend (by type of intervention)
31




I need to be able to remove any column that has 'None' in the header before rexporting the data to another excel worksheet, which can then be updated with new data as new client records are submitted.

Any advice would be much appreciated.

like image 466
KevOMalley743 Avatar asked Nov 07 '22 09:11

KevOMalley743


1 Answers

So you have an Excel sheet that has some columns without data. And xlwings will set all cells without data as NaN/None by default.

What you can do is to only keep columns where the name is not None with:

cols = [x for x in df.columns if x is not None]
df = df[cols]

Then df will only keep the relevant columns.

like image 146
villoro Avatar answered Nov 14 '22 22:11

villoro