Importing a heavily formatted excel worksheet into pandas results in some columns which are entirely blank and have 'None' when viewing df.columns
. I need to remove these columns but I'm getting some strange output that makes it hard for me to figure out how exactly to drop them.
****Editing for clarity****
The excel worksheet is heavily formatted and must be reshaped for the data to be used in analysis. In essence, col A is a list of questions and col B is an explanation of each question and col C is the response to the question. The desired result is that col A becomes the header of a tabular dataset, col B is dropped, and col C is the first row. This then needs to be saved in such a way that col C of another copy of the excel worksheet (which would be filled out for another client) can be appended to the tabular data set.
I have been able to import the worksheet into python and pandas, transpose the data, and do some minimal reshaping and cleaning.
example code:
import os
import pandas as pd
import xlwings as xw
dir_path = "C:\\Users\\user.name\\directory\\project\\data\\january"
file_path = "C:\\Users\\user.name\\directory\\project\\data\\january\\D10A0021_10.01.20.xlsx"
os.chdir(dir_path)# setting the directory
wb = xw.Book(file_path, password = 'mypassword') # getting python to open the workbook
demographics = wb.sheets[0] # selecting the demographic sheet.
df = demographics['B2:D33'].options(pd.DataFrame, index=False, header = True).value # importing all the used cells into pandas
df.columns = [0,1,2] #adding column names that I can track
df = df.T #Transposing the data
df.columns = df.loc[0] #turning the question items into the column headers
df = df.loc[2:] remove the unneeded first and second row from the set
for num, col in enumerate(df.columns):
print(f'{num}: {col}') # This code has fixed the issue one of the issues. Suggested by Datanovice.
Output:
0: Client code
1: Client's date of birth
2: Sex
3: Previous symptom recurrence
4: None
5: Has the client attended Primary Care Psychology in the past?
6: None
7: Ethnicity
8: None
9: Did the parent/ guardian/ carer require help completing the scales due to literacy difficulties?
10: Did the parent/ guardian/ carer require help completing the scales due to perceived complexity of questionnaires?
11: Did the client require help completing the scales due to literacy difficulties?
12: Did the client require help completing the scales due to perceived complexity of questionnaires?
13: Accommodation status
14: None
15: Relationship with main carer
16: None
17: Any long term stressors
18: Referral source
19: Referral date
20: Referral reason
21: Actual presenting difficulty (post formulation)
22: Date first seen
23: Discharge date
24: Reason for terminating treatment
25: None
26: Type of intervention
27: Total number of sessions offered (including DNA’s CNA’s)
28: No. of sessions: attended (by type of intervention)
29: No. of sessions: did not attend (by type of intervention)
30: No. of sessions: could not attend (by type of intervention)
31
I need to be able to remove any column that has 'None' in the header before rexporting the data to another excel worksheet, which can then be updated with new data as new client records are submitted.
Any advice would be much appreciated.
So you have an Excel sheet that has some columns without data.
And xlwings
will set all cells without data as NaN
/None
by default.
What you can do is to only keep columns where the name is not None
with:
cols = [x for x in df.columns if x is not None]
df = df[cols]
Then df
will only keep the relevant columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With