The data I am using has some null values and I want to impute the Null values using knn Imputation. In order to effectively impute I want to Normalize the data.
normalizer = Normalizer() #from sklearn.preprocessing
normalizer.fit_transform(data[num_cols]) #columns with numeric value
Error: Input contains NaN, infinity or a value too large for dtype('float64').
So how do I normalize data that is having NaN
Deleting the row with missing data If there is a certain row with missing data, then you can delete the entire row with all the features in that row. axis=1 is used to drop the column with `NaN` values. axis=0 is used to drop the row with `NaN` values.
You can normalize data between 0 and 1 range by using the formula (data – np. min(data)) / (np. max(data) – np. min(data)) .
Using MinMaxScaler() to Normalize Data in Python This is a more popular choice for normalizing datasets. You can see that the values in the output are between (0 and 1). MinMaxScaler also gives you the option to select feature range. By default, the range is set to (0,1).
I would suggest not to use normalize in sklearn as it does not deal with NaNs. You can simply use below code to normalize your data.
df['col']=(df['col']-df['col'].min())/(df['col'].max()-df['col'].min())
Above method ignores NaNs while noramlizing the data
sklearn.preprocessing.Normalizer is not about 0 mean, 1 stdev normalization like the other answers to date. Normalizer() is about scaling rows to unit norm e.g. to improve clustering or the original questions imputation. You can read about the differences here and here. For scaling rows you could try something like this:
import numpy as np
A = np.array([[ 7, 4, 5, 7000],
[ 1, 900, 9, nan],
[ 5, -1000, nan, 100],
[nan, nan, 3, 1000]])
#Compute NaN-norms
L1_norm = np.nansum(np.abs(A), axis=1)
L2_norm = np.sqrt(np.nansum(A**2, axis=1))
max_norm = np.nanmax(np.abs(A), axis=1)
#Normalize rows
A_L1 = A / L1_norm[:,np.newaxis] # A.values if Dataframe
A_L2 = A / L2_norm[:,np.newaxis]
A_max = A / max_norm[:,np.newaxis]
#Check that it worked
L1_norm_after = np.nansum(np.abs(A_L1), axis=1)
L2_norm_after = np.sqrt(np.nansum(A_L2**2, axis=1))
max_norm_after = np.nanmax(np.abs(A_max), axis=1)
In[182]: L1_norm_after
Out[182]: array([1., 1., 1., 1.])
In[183]: L2_norm_after
Out[183]: array([1., 1., 1., 1.])
In[184]: max_norm_after
Out[184]: array([1., 1., 1., 1.])
If Google brought you here (like me) and you want to normalize columns to 0 mean, 1 stdev using the estimator API you can use sklearn.preprocessing.StandardScaler. It can handle NaNs (Tested on sklearn 0.20.2, I remember it didn't work on some older versions).
from numpy import nan, nanmean
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
A = [[ 7, 4, 5, 7000],
[ 1, 900, 9, nan],
[ 5, -1000, nan, 100],
[nan, nan, 3, 1000]]
scaler.fit(A)
In [45]: scaler.mean_
Out[45]: array([4.33333333, -32., 5.66666667, 2700.])
In [46]: scaler.transform(A)
Out[46]: array([[ 1.06904497, 0.04638641, -0.26726124, 1.40399977],
[-1.33630621, 1.20089267, 1.33630621, nan],
[ 0.26726124, -1.24727908, nan, -0.84893009],
[ nan, nan, -1.06904497, -0.55506968]])
In [54]: nanmean(scaler.transform(A), axis=0)
Out[54]: array([ 1.48029737e-16, 0.00000000e+00, -1.48029737e-16,0.00000000e+00])
This method normalize all the columns to [0,1], and NaN remains being NaN
def norm_to_zero_one(df):
return (df - df.min()) * 1.0 / (df.max() - df.min())
For example:
[In]
df = pd.DataFrame({'A': [10, 20, np.nan, 30],
'B': [1, np.nan, 10, 5]})
df = df.apply(norm_to_zero_one)
[Out]
A B
0 0.0 0.000000
1 0.5 NaN
2 NaN 1.000000
3 1.0 0.444444
df.max()
and df.min()
return the max and min of each column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With