Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalize data in pandas

People also ask

How do you normalize data in Python?

Using MinMaxScaler() to Normalize Data in Python This is a more popular choice for normalizing datasets. You can see that the values in the output are between (0 and 1). MinMaxScaler also gives you the option to select feature range. By default, the range is set to (0,1).

What does normalize true do in Pandas?

With normalize set to True , returns the relative frequency by dividing all values by the sum of values. Bins can be useful for going from a continuous variable to a categorical variable; instead of counting unique apparitions of values, divide the index in the specified number of half-open bins.

How do I standardize a column of Pandas DataFrame?

Method 1: Implementation in pandas [Z-Score] In this method, we are going to standardize the first column of the data set using pandas built-in functions mean() and std() which will give the mean and standard deviations of the column data.


In [92]: df
Out[92]:
           a         b          c         d
A  -0.488816  0.863769   4.325608 -4.721202
B -11.937097  2.993993 -12.916784 -1.086236
C  -5.569493  4.672679  -2.168464 -9.315900
D   8.892368  0.932785   4.535396  0.598124

In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())

In [94]: df_norm
Out[94]:
          a         b         c         d
A  0.085789 -0.394348  0.337016 -0.109935
B -0.463830  0.164926 -0.650963  0.256714
C -0.158129  0.605652 -0.035090 -0.573389
D  0.536170 -0.376229  0.349037  0.426611

In [95]: df_norm.mean()
Out[95]:
a   -2.081668e-17
b    4.857226e-17
c    1.734723e-17
d   -1.040834e-17

In [96]: df_norm.max() - df_norm.min()
Out[96]:
a    1
b    1
c    1
d    1

If you don't mind importing the sklearn library, I would recommend the method talked on this blog.

import pandas as pd
from sklearn import preprocessing

data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df

min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized

You can use apply for this, and it's a bit neater:

import numpy as np
import pandas as pd

np.random.seed(1)

df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)

          0         1         2         3
0  9.497381  0.552974  0.887313 -1.291874
1  6.461631 -6.206155  9.979247 -0.044828
2  4.276156  2.002518  8.848432 -5.240563
3  1.710331  1.463783  7.535078 -1.399565

df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

          0         1         2         3
0  0.515087  0.133967 -0.651699  0.135175
1  0.125241 -0.689446  0.348301  0.375188
2 -0.155414  0.310554  0.223925 -0.624812
3 -0.484913  0.244924  0.079473  0.114448

Also, it works nicely with groupby, if you select the relevant columns:

df['grp'] = ['A', 'A', 'B', 'B']

          0         1         2         3 grp
0  9.497381  0.552974  0.887313 -1.291874   A
1  6.461631 -6.206155  9.979247 -0.044828   A
2  4.276156  2.002518  8.848432 -5.240563   B
3  1.710331  1.463783  7.535078 -1.399565   B


df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

     0    1    2    3
0  0.5  0.5 -0.5 -0.5
1 -0.5 -0.5  0.5  0.5
2  0.5  0.5  0.5 -0.5
3 -0.5 -0.5 -0.5  0.5

Slightly modified from: Python Pandas Dataframe: Normalize data between 0.01 and 0.99? but from some of the comments thought it was relevant (sorry if considered a repost though...)

I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way... because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn't true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):

def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):    
    if low=='min':
        low=min(s)
    elif low=='abs':
        low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
    if hi=='max':
        hi=max(s)
    elif hi=='abs':
        hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))

    if center=='mid':
        center=(max(s)+min(s))/2
    elif center=='avg':
        center=mean(s)
    elif center=='median':
        center=median(s)

    s2=[x-center for x in s]
    hi=hi-center
    low=low-center
    center=0.

    r=[]

    for x in s2:
        if x<low:
            r.append(0.)
        elif x>hi:
            r.append(1.)
        else:
            if x>=center:
                r.append((x-center)/(hi-center)*0.5+0.5)
            else:
                r.append((x-low)/(center-low)*0.5+0.)

    if insideout==True:
        ir=[(1.-abs(z-0.5)*2.) for z in r]
        r=ir

    rr =[x-(x-0.5)*shrinkfactor for x in r]    
    return rr

This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our "10" is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:

#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]

It can also turn your data inside out... this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:

#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]

So now "2" which is closest to the center, defined as "1" is the highest value.

Anyways, I thought my application was relevant if you're looking to rescale data in other ways that could have useful applications to you.