I have a large pandas dataframe with about 80 columns. Each of the 80 columns in the dataframe report daily traffic statistics for websites (the columns are the websites).
As I don't want to work with the raw traffic statistics, I rather like to normalize all of my columns (except for the first, which is the date). Either from 0 to 1 or (even better) from 0 to 100.
Date A B ...
10/10/2010 100.0 402.0 ...
11/10/2010 250.0 800.0 ...
12/10/2010 800.0 2000.0 ...
13/10/2010 400.0 1800.0 ...
That being said, I wonder which normalization to apply. Min-Max scaling vs. z-Score Normalization (standardization)? Some of my columns have strong outliers. It would be great to have an example. I am sorry not being able to provide the full data.
First, turn your Date column into an index.
dates = df.pop('Date')
df.index = dates
Then either use z-score normalizing:
df1 = (df - df.mean())/df.std()
or min-max scaling:
df2 = (df-df.min())/(df.max()-df.min())
I would probably advise z-score normalization, because min-max scaling is highly susceptible to outliers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With