Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas: Best way to normalize data? [duplicate]

I have a large pandas dataframe with about 80 columns. Each of the 80 columns in the dataframe report daily traffic statistics for websites (the columns are the websites).

As I don't want to work with the raw traffic statistics, I rather like to normalize all of my columns (except for the first, which is the date). Either from 0 to 1 or (even better) from 0 to 100.

Date        A      B      ...
10/10/2010  100.0  402.0  ...
11/10/2010  250.0  800.0  ...
12/10/2010  800.0  2000.0 ...
13/10/2010  400.0  1800.0 ...

That being said, I wonder which normalization to apply. Min-Max scaling vs. z-Score Normalization (standardization)? Some of my columns have strong outliers. It would be great to have an example. I am sorry not being able to provide the full data.

like image 915
Rnaldinho Avatar asked Oct 22 '16 21:10

Rnaldinho


Video Answer


1 Answers

First, turn your Date column into an index.

dates = df.pop('Date')
df.index = dates

Then either use z-score normalizing:

df1 = (df - df.mean())/df.std()

or min-max scaling:

df2 = (df-df.min())/(df.max()-df.min())

I would probably advise z-score normalization, because min-max scaling is highly susceptible to outliers.

like image 100
User191919 Avatar answered Sep 18 '22 23:09

User191919