For a list of numbers ranging from x
to y
that may contain NaN
, how can I normalise between 0 and 1, ignoring the NaN
values (they stay as NaN
).
Typically I would use MinMaxScaler
(ref page) from sklearn.preprocessing
, but this cannot handle NaN
and recommends imputing the values based on mean or median etc. it doesn't offer the option to ignore all the NaN
values.
The formula for normalizing the data between 0 and 1 range is given below. To normalize a value, subtract it from the minimum value of the dataset and divide it by using the difference between the maximum and minimum value of the dataset.
When you normalize the data of the different scales, both the values will be transformed to the same scale/range. For example, both values will be in the range between 0 and 1. The lowest value in the data will have the value 0 and the highest value in the data will have the value 1 and the other values will be within the range 0 and 1.
If Google brought you here (like me) and you want to normalize columns to 0 mean, 1 stdev using the estimator API you can use sklearn.preprocessing.StandardScaler. It can handle NaNs (Tested on sklearn 0.20.2, I remember it didn't work on some older versions).
Typically I would use MinMaxScaler ( ref page) from sklearn.preprocessing, but this cannot handle NaN and recommends imputing the values based on mean or median etc. it doesn't offer the option to ignore all the NaN values. Use np.nanmax and np.nanmin instead of np.max and np.min, the rest should work fine.
consider pd.Series
s
s = pd.Series(np.random.choice([3, 4, 5, 6, np.nan], 100)) s.hist()
Option 1
Min Max Scaling
new = s.sub(s.min()).div((s.max() - s.min())) new.hist()
NOT WHAT OP ASKED FOR
I put these in because I wanted to
Option 2
sigmoid
sigmoid = lambda x: 1 / (1 + np.exp(-x)) new = sigmoid(s.sub(s.mean())) new.hist()
Option 3
tanh (hyperbolic tangent)
new = np.tanh(s.sub(s.mean())).add(1).div(2) new.hist()
Here's a different approach and one that I believe answers the OP correctly, the only difference is this works for a dataframe instead of a list, you can easily put your list in a dataframe as done below. The other options didn't work for me because I needed to store the MinMaxScaler in order to reverse transform after a prediction was made. So instead of passing the entire column to the MinMaxScaler you can filter out NaNs for both the target and the input.
Solution Example
import pandas as pd import numpy as np from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) d = pd.DataFrame({'A': [0, 1, 2, 3, np.nan, 3, 2]}) null_index = d['A'].isnull() d.loc[~null_index, ['A']] = scaler.fit_transform(d.loc[~null_index, ['A']])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With