Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to Normalize data with NaN values in python

The data I am using has some null values and I want to impute the Null values using knn Imputation. In order to effectively impute I want to Normalize the data.

normalizer = Normalizer() #from sklearn.preprocessing
normalizer.fit_transform(data[num_cols]) #columns with numeric value

Error: Input contains NaN, infinity or a value too large for dtype('float64').

So how do I normalize data that is having NaN

like image 508
Jayashree Gowda Avatar asked Jan 16 '18 07:01

Jayashree Gowda

People also ask

How do you fix NaN in Python?

Deleting the row with missing data If there is a certain row with missing data, then you can delete the entire row with all the features in that row. axis=1 is used to drop the column with `NaN` values. axis=0 is used to drop the row with `NaN` values.

How do you normalize data to 0 1 range in Python?

You can normalize data between 0 and 1 range by using the formula (data – np. min(data)) / (np. max(data) – np. min(data)) .

How do I normalize data in Python?

Using MinMaxScaler() to Normalize Data in Python This is a more popular choice for normalizing datasets. You can see that the values in the output are between (0 and 1). MinMaxScaler also gives you the option to select feature range. By default, the range is set to (0,1).

3 Answers

I would suggest not to use normalize in sklearn as it does not deal with NaNs. You can simply use below code to normalize your data.


Above method ignores NaNs while noramlizing the data

like image 52
Sociopath Avatar answered Oct 20 '22 23:10


sklearn.preprocessing.Normalizer is not about 0 mean, 1 stdev normalization like the other answers to date. Normalizer() is about scaling rows to unit norm e.g. to improve clustering or the original questions imputation. You can read about the differences here and here. For scaling rows you could try something like this:

import numpy as np

A = np.array([[  7,     4,   5,  7000],
              [  1,   900,   9,   nan],
              [  5, -1000, nan,   100],
              [nan,   nan,   3,  1000]])

#Compute NaN-norms
L1_norm = np.nansum(np.abs(A), axis=1)
L2_norm = np.sqrt(np.nansum(A**2, axis=1))
max_norm = np.nanmax(np.abs(A), axis=1)

#Normalize rows
A_L1 =  A / L1_norm[:,np.newaxis] # A.values if Dataframe
A_L2 =  A / L2_norm[:,np.newaxis]
A_max = A / max_norm[:,np.newaxis]

#Check that it worked
L1_norm_after = np.nansum(np.abs(A_L1), axis=1)
L2_norm_after = np.sqrt(np.nansum(A_L2**2, axis=1))
max_norm_after = np.nanmax(np.abs(A_max), axis=1)

 In[182]: L1_norm_after
Out[182]: array([1., 1., 1., 1.])

 In[183]: L2_norm_after
Out[183]: array([1., 1., 1., 1.])

 In[184]: max_norm_after
Out[184]: array([1., 1., 1., 1.])

If Google brought you here (like me) and you want to normalize columns to 0 mean, 1 stdev using the estimator API you can use sklearn.preprocessing.StandardScaler. It can handle NaNs (Tested on sklearn 0.20.2, I remember it didn't work on some older versions).

from numpy import nan, nanmean
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

A = [[  7,     4,   5,  7000],
     [  1,   900,   9,   nan],
     [  5, -1000, nan,   100],
     [nan,   nan,   3,  1000]]


In [45]: scaler.mean_
Out[45]: array([4.33333333,  -32.,    5.66666667, 2700.])

In [46]: scaler.transform(A)
Out[46]: array([[ 1.06904497,  0.04638641, -0.26726124,  1.40399977],
                [-1.33630621,  1.20089267,  1.33630621,         nan],
                [ 0.26726124, -1.24727908,         nan, -0.84893009],
                [        nan,         nan, -1.06904497, -0.55506968]])

In [54]: nanmean(scaler.transform(A), axis=0)
Out[54]: array([ 1.48029737e-16,  0.00000000e+00, -1.48029737e-16,0.00000000e+00])
like image 39
Tapio Avatar answered Oct 21 '22 00:10


This method normalize all the columns to [0,1], and NaN remains being NaN

def norm_to_zero_one(df):
    return (df - df.min()) * 1.0 / (df.max() - df.min())

For example:

df = pd.DataFrame({'A': [10, 20, np.nan, 30],
                   'B': [1, np.nan, 10, 5]})
df = df.apply(norm_to_zero_one)
     A         B
0  0.0  0.000000
1  0.5       NaN
2  NaN  1.000000
3  1.0  0.444444

df.max() and df.min() return the max and min of each column.

like image 1
jz0410 Avatar answered Oct 20 '22 23:10
