Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalise between 0 and 1 ignoring NaN

Tags:

For a list of numbers ranging from x to y that may contain NaN, how can I normalise between 0 and 1, ignoring the NaN values (they stay as NaN).

Typically I would use MinMaxScaler (ref page) from sklearn.preprocessing, but this cannot handle NaN and recommends imputing the values based on mean or median etc. it doesn't offer the option to ignore all the NaN values.

like image 916
JakeCowton Avatar asked Sep 28 '16 22:09

JakeCowton


People also ask

How to normalize the data between 0 and 1 range?

The formula for normalizing the data between 0 and 1 range is given below. To normalize a value, subtract it from the minimum value of the dataset and divide it by using the difference between the maximum and minimum value of the dataset.

What happens when you normalize the data of different scales?

When you normalize the data of the different scales, both the values will be transformed to the same scale/range. For example, both values will be in the range between 0 and 1. The lowest value in the data will have the value 0 and the highest value in the data will have the value 1 and the other values will be within the range 0 and 1.

How to normalize columns to 0 mean and 1 stdev?

If Google brought you here (like me) and you want to normalize columns to 0 mean, 1 stdev using the estimator API you can use sklearn.preprocessing.StandardScaler. It can handle NaNs (Tested on sklearn 0.20.2, I remember it didn't work on some older versions).

How can I ignore NaN values in sklearn?

Typically I would use MinMaxScaler ( ref page) from sklearn.preprocessing, but this cannot handle NaN and recommends imputing the values based on mean or median etc. it doesn't offer the option to ignore all the NaN values. Use np.nanmax and np.nanmin instead of np.max and np.min, the rest should work fine.


2 Answers

consider pd.Series s

s = pd.Series(np.random.choice([3, 4, 5, 6, np.nan], 100)) s.hist() 

enter image description here


Option 1
Min Max Scaling

new = s.sub(s.min()).div((s.max() - s.min())) new.hist() 

enter image description here


NOT WHAT OP ASKED FOR
I put these in because I wanted to

Option 2
sigmoid

sigmoid = lambda x: 1 / (1 + np.exp(-x))  new = sigmoid(s.sub(s.mean())) new.hist() 

enter image description here


Option 3
tanh (hyperbolic tangent)

new = np.tanh(s.sub(s.mean())).add(1).div(2) new.hist() 

enter image description here

like image 186
piRSquared Avatar answered Nov 18 '22 12:11

piRSquared


Here's a different approach and one that I believe answers the OP correctly, the only difference is this works for a dataframe instead of a list, you can easily put your list in a dataframe as done below. The other options didn't work for me because I needed to store the MinMaxScaler in order to reverse transform after a prediction was made. So instead of passing the entire column to the MinMaxScaler you can filter out NaNs for both the target and the input.

Solution Example

import pandas as pd import numpy as np from sklearn.preprocessing import MinMaxScaler  scaler = MinMaxScaler(feature_range=(0, 1))  d = pd.DataFrame({'A': [0, 1, 2, 3, np.nan, 3, 2]})  null_index = d['A'].isnull()  d.loc[~null_index, ['A']] = scaler.fit_transform(d.loc[~null_index, ['A']]) 
like image 36
Chris Farr Avatar answered Nov 18 '22 13:11

Chris Farr