Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scale dataframes consistently MinMaxScaler() sklearn

I have three data frames that are each scaled individually with MinMaxScaler().

def scale_dataframe(values_to_be_scaled)
    values = values_to_be_scaled.astype('float64')
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled = scaler.fit_transform(values)

    return scaled

scaled_values = []
for i in range(0,num_df):
    scaled_values.append(scale_dataframe(df[i].values))

The problem I am having is that each dataframe gets scaled according to its own individual set of column min and max values. I need all of my dataframes to scale to the same values as if they all shared the same set of column min and max values for the data overall. Is there a way to accomplish this with MinMaxScaler()? One option would be to make one large dataframe, then scale the dataframe before partitioning, but this would not be ideal.

like image 613
xjtc55 Avatar asked Dec 09 '17 19:12

xjtc55


People also ask

How do you do MIN MAX scaling?

A Min-Max scaling is typically done via the following equation: Xsc=X−XminXmax−Xmin. One family of algorithms that is scale-invariant encompasses tree-based learning algorithms.

Does MinMaxScaler normalize data?

You can normalize your dataset using the scikit-learn object MinMaxScaler. Good practice usage with the MinMaxScaler and other scaling techniques is as follows: Fit the scaler using available training data.

Which is better MinMaxScaler or StandardScaler?

StandardScaler is useful for the features that follow a Normal distribution. This is clearly illustrated in the image below (source). MinMaxScaler may be used when the upper and lower boundaries are well known from domain knowledge (e.g. pixel intensities that go from 0 to 255 in the RGB color range). Save this answer.


1 Answers

Check out the excellent docs of sklearn.

As you see, there is support for partial_fit()! This allows online-scaling/minibatch-scaling and you can control the minibatches!

Example:

import numpy as np
from sklearn.preprocessing import MinMaxScaler

a = np.array([[1,2,3]])
b = np.array([[10,20,30]])
c = np.array([[5, 10, 15]])

""" Scale on all datasets together in one batch """
offline_scaler = MinMaxScaler()
offline_scaler.fit(np.vstack((a, b, c)))                # fit on whole data at once
a_offline_scaled = offline_scaler.transform(a)
b_offline_scaled = offline_scaler.transform(b)
c_offline_scaled = offline_scaler.transform(c)
print('Offline scaled')
print(a_offline_scaled)
print(b_offline_scaled)
print(c_offline_scaled)

""" Scale on all datasets together in minibatches """
online_scaler = MinMaxScaler()
online_scaler.partial_fit(a)                            # partial fit 1
online_scaler.partial_fit(b)                            # partial fit 2
online_scaler.partial_fit(c)                            # partial fit 3
a_online_scaled = online_scaler.transform(a)
b_online_scaled = online_scaler.transform(b)
c_online_scaled = online_scaler.transform(c)
print('Online scaled')
print(a_online_scaled)
print(b_online_scaled)
print(c_online_scaled)

Output:

Offline scaled
[[ 0.  0.  0.]]
[[ 1.  1.  1.]]
[[ 0.44444444  0.44444444  0.44444444]]
Online scaled
[[ 0.  0.  0.]]
[[ 1.  1.  1.]]
[[ 0.44444444  0.44444444  0.44444444]]
like image 147
sascha Avatar answered Oct 02 '22 15:10

sascha