Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Compute z-score for all columns

I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here's a subsection of it:

ID      Age    BMI    Risk Factor PT 6    48     19.3    4 PT 8    43     20.9    NaN PT 2    39     18.1    3 PT 9    41     19.5    NaN 

Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to zscore normalize pandas column with nans?

df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0) 

I'm interested in applying this solution to all of my columns except the ID column to produce a new dataframe which I can save as an Excel file using

df2.to_excel("Z-Scores.xlsx") 

So basically; how can I compute z-scores for each column (ignoring NaN values) and push everything into a new dataframe?

SIDENOTE: there is a concept in pandas called "indexing" which intimidates me because I do not understand it well. If indexing is a crucial part of solving this problem, please dumb down your explanation of indexing.

like image 872
Slavatron Avatar asked Jul 15 '14 15:07

Slavatron


People also ask

How do you find the z-score of a DataFrame in Python?

For each value in an array, the z-score is calculated by dividing the difference between the value and the mean by the standard deviation of the distribution.

How does Pandas calculate average of all columns?

To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.


Video Answer


1 Answers

Build a list from the columns and remove the column you don't want to calculate the Z score for:

In [66]: cols = list(df.columns) cols.remove('ID') df[cols]  Out[66]:    Age  BMI  Risk  Factor 0    6   48  19.3       4 1    8   43  20.9     NaN 2    2   39  18.1       3 3    9   41  19.5     NaN In [68]: # now iterate over the remaining columns and create a new zscore column for col in cols:     col_zscore = col + '_zscore'     df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0) df Out[68]:    ID  Age  BMI  Risk  Factor  Age_zscore  BMI_zscore  Risk_zscore  \ 0  PT    6   48  19.3       4   -0.093250    1.569614    -0.150946    1  PT    8   43  20.9     NaN    0.652753    0.074744     1.459148    2  PT    2   39  18.1       3   -1.585258   -1.121153    -1.358517    3  PT    9   41  19.5     NaN    1.025755   -0.523205     0.050315        Factor_zscore   0              1   1            NaN   2             -1   3            NaN   
like image 161
EdChum Avatar answered Sep 22 '22 06:09

EdChum