Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I normalize the data in a range of columns in my pandas dataframe

Tags:

python

pandas

Suppose I have a pandas data frame surveyData:

I want to normalize the data in each column by performing:

surveyData_norm = (surveyData - surveyData.mean()) / (surveyData.max() - surveyData.min()) 

This would work fine if my data table only contained the columns I wanted to normalize. However, I have some columns containing string data preceding like:

Name  State  Gender  Age  Income  Height Sam   CA     M        13   10000    70 Bob   AZ     M        21   25000    55 Tom   FL     M        30   100000   45 

I only want to normalize the Age, Income, and Height columns but my above method does not work becuase of the string data in the name state and gender columns.

like image 983
Jeremy Avatar asked Feb 18 '15 05:02

Jeremy


People also ask

How do I normalize all columns in Pandas?

To normalize all columns of the dataframe, we first subtract the column mean, and then divide by the standard deviation. Then, we range all columns of the dataframe, such that the min is 0 and the max is 1.

How do you normalize a range in Python?

You can normalize data between 0 and 1 range by using the formula (data – np. min(data)) / (np. max(data) – np. min(data)) .


2 Answers

You can perform operations on a sub set of rows or columns in pandas in a number of ways. One useful way is indexing:

# Assuming same lines from your example cols_to_norm = ['Age','Height'] survey_data[cols_to_norm] = survey_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min())) 

This will apply it to only the columns you desire and assign the result back to those columns. Alternatively you could set them to new, normalized columns and keep the originals if you want.

like image 99
cwharland Avatar answered Sep 19 '22 09:09

cwharland


I think it's better to use 'sklearn.preprocessing' in this case which can give us much more scaling options. The way of doing that in your case when using StandardScaler would be:

from sklearn.preprocessing import StandardScaler cols_to_norm = ['Age','Height'] surveyData[cols_to_norm] = StandardScaler().fit_transform(surveyData[cols_to_norm]) 
like image 24
Yaron Avatar answered Sep 21 '22 09:09

Yaron