How to Normalize data with NaN values in python

Tags:

The data I am using has some null values and I want to impute the Null values using knn Imputation. In order to effectively impute I want to Normalize the data.

normalizer = Normalizer() #from sklearn.preprocessing
normalizer.fit_transform(data[num_cols]) #columns with numeric value

Error: Input contains NaN, infinity or a value too large for dtype('float64').

So how do I normalize data that is having NaN

508

asked Jan 16 '18 07:01

3 Answers

I would suggest not to use normalize in sklearn as it does not deal with NaNs. You can simply use below code to normalize your data.

df['col']=(df['col']-df['col'].min())/(df['col'].max()-df['col'].min())

Above method ignores NaNs while noramlizing the data

answered Oct 20 '22 23:10

sklearn.preprocessing.Normalizer is not about 0 mean, 1 stdev normalization like the other answers to date. Normalizer() is about scaling rows to unit norm e.g. to improve clustering or the original questions imputation. You can read about the differences here and here. For scaling rows you could try something like this:

import numpy as np

A = np.array([[  7,     4,   5,  7000],
              [  1,   900,   9,   nan],
              [  5, -1000, nan,   100],
              [nan,   nan,   3,  1000]])

#Compute NaN-norms
L1_norm = np.nansum(np.abs(A), axis=1)
L2_norm = np.sqrt(np.nansum(A**2, axis=1))
max_norm = np.nanmax(np.abs(A), axis=1)

#Normalize rows
A_L1 =  A / L1_norm[:,np.newaxis] # A.values if Dataframe
A_L2 =  A / L2_norm[:,np.newaxis]
A_max = A / max_norm[:,np.newaxis]

#Check that it worked
L1_norm_after = np.nansum(np.abs(A_L1), axis=1)
L2_norm_after = np.sqrt(np.nansum(A_L2**2, axis=1))
max_norm_after = np.nanmax(np.abs(A_max), axis=1)

 In[182]: L1_norm_after
Out[182]: array([1., 1., 1., 1.])

 In[183]: L2_norm_after
Out[183]: array([1., 1., 1., 1.])

 In[184]: max_norm_after
Out[184]: array([1., 1., 1., 1.])

If Google brought you here (like me) and you want to normalize columns to 0 mean, 1 stdev using the estimator API you can use sklearn.preprocessing.StandardScaler. It can handle NaNs (Tested on sklearn 0.20.2, I remember it didn't work on some older versions).

from numpy import nan, nanmean
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

A = [[  7,     4,   5,  7000],
     [  1,   900,   9,   nan],
     [  5, -1000, nan,   100],
     [nan,   nan,   3,  1000]]

scaler.fit(A)

In [45]: scaler.mean_
Out[45]: array([4.33333333,  -32.,    5.66666667, 2700.])

In [46]: scaler.transform(A)
Out[46]: array([[ 1.06904497,  0.04638641, -0.26726124,  1.40399977],
                [-1.33630621,  1.20089267,  1.33630621,         nan],
                [ 0.26726124, -1.24727908,         nan, -0.84893009],
                [        nan,         nan, -1.06904497, -0.55506968]])

In [54]: nanmean(scaler.transform(A), axis=0)
Out[54]: array([ 1.48029737e-16,  0.00000000e+00, -1.48029737e-16,0.00000000e+00])

answered Oct 21 '22 00:10

Tapio

This method normalize all the columns to [0,1], and NaN remains being NaN

def norm_to_zero_one(df):
    return (df - df.min()) * 1.0 / (df.max() - df.min())

For example:

[In]
df = pd.DataFrame({'A': [10, 20, np.nan, 30],
                   'B': [1, np.nan, 10, 5]})
df = df.apply(norm_to_zero_one)
[Out]
     A         B
0  0.0  0.000000
1  0.5       NaN
2  NaN  1.000000
3  1.0  0.444444

df.max() and df.min() return the max and min of each column.

answered Oct 20 '22 23:10

jz0410

Related questions
                            
                                Why is the dtype shown (even if it's the native one) when using floor division with NumPy?
                            
                                Poor performance of C++ function in Cython
                            
                                Python Library to create and visualize HyperGraph
                            
                                isinstance() unexpectedly returning False
                            
                                How to catch exceptions in a python run_in_executor method call
                            
                                Create package with cython so users can install it without having cython already installed
                            
                                pandas DataFrame.query expression that returns all rows by default
                            
                                How to uniformly resample a non-uniform signal using SciPy?
                            
                                pygame.mixer.music.play() doesn't recognize Fast Tracker (.xm music format) repeat position
                            
                                PyGILState_Ensure() Causing Deadlock
                            
                                Is installing NodeJS packages locally equivalent to Python's virtualenv?
                            
                                is there a simple way to use features from tf.data.Dataset.from_generator with a custom model_fn(Estimator) in tensorflow
                            
                                Different virtualenv's on one Jupyter notebook
                            
                                Iterate through columns in Read-only workbook in openpyxl
                            
                                Running flask as package in production
                            
                                How to use priority in celery task.apply_async
                            
                                JSONDecodeError using Google Translate API with Python3
                            
                                passing args and kwargs to parent class with extra content in django CreateView
                            
                                Matplotlib animations do not work in PyCharm
                            
                                Numpy isnat() returns value error on datetime objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Normalize data with NaN values in python

Tags:

python

dataframe

normalize

Jayashree Gowda

People also ask

3 Answers

Sociopath

Tapio

jz0410

Recent Activity

Donate For Us