How to normalize the Train and Test data using MinMaxScaler sklearn

Tags:

So, I have this doubt and have been looking for answers. So the question is when I use,

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()

df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})

df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)

After which I will train and test the model (A,B as features, C as Label) and get some accuracy score. Now my doubt is, what happens when I have to predict the label for new set of data. Say,

df = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})

Because when I normalize the column the values of A and B will be changed according to the new data, not the data which the model will be trained on. So, now my data after the data preparation step that is as below, will be.

data[['A','B']] = min_max_scaler.fit_transform(data[['A','B']])

Values of A and B will change with respect to the Max and Min value of df[['A','B']]. The data prep of df[['A','B']] is with respect to Min Max of df[['A','B']].

How can the data preparation be valid with respect to different numbers relate? I don't understand how the prediction will be correct here.

977

asked May 28 '18 11:05

Tia

2 Answers

You should fit the `MinMaxScaler` using the `training` data and then apply the scaler on the `testing` data before the prediction.

In summary:

Step 1: fit the scaler on the TRAINING data
Step 2: use the scaler to transform the TRAINING data
Step 3: use the transformed training data to fit the predictive model
Step 4: use the scaler to transform the TEST data
Step 5: predict using the trained model (step 3) and the transformed TEST data (step 4).

Example using your data:

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
#training data
df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
#fit and transform the training data and use them for the model training
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)

#fit the model
model.fit(df['A','B'])

#after the model training on the transformed training data define the testing data df_test
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})

#before the prediction of the test data, ONLY APPLY the scaler on them
df_test[['A','B']] = min_max_scaler.transform(df_test[['A','B']])

#test the model
y_predicted_from_model = model.predict(df_test['A','B'])

Example using iris data:

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

data = datasets.load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = SVC()
model.fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)
y_pred = model.predict(X_test_scaled)

Hope this helps.

See also by post here: https://towardsdatascience.com/everything-you-need-to-know-about-min-max-normalization-in-python-b79592732b79

answered Oct 08 '22 23:10

seralouk

Best way is train and save MinMaxScaler model and load the same when it's required.

Saving model:

df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])  
pickle.dump(min_max_scaler, open("scaler.pkl", 'wb'))

Loading saved model:

scalerObj = pickle.load(open("scaler.pkl", 'rb'))
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
df_test[['A','B']] = scalerObj.transform(df_test[['A','B']])

answered Oct 09 '22 00:10

vipin bansal

Related questions
                            
                                Correlation of Two Variables in a Time Series in Python?
                            
                                Choosing a file in Python3
                            
                                Python Google Maps Driving Time
                            
                                How to check if a variable matches any item in list using the any() function?
                            
                                conditional breakpoint using pdb
                            
                                Django 1.7 removing Add button from inline form
                            
                                Pandas: Selecting rows based on value counts of a particular column
                            
                                How can I split a string of a mathematical expressions in python?
                            
                                Is there a production ready web application framework in Python?
                            
                                find largest power of two less than X number?
                            
                                How to to filter dict to select only keys greater than a value? [duplicate]
                            
                                How to use pprint to print an object using the built-in __str__(self) method?
                            
                                converting list of string to list of integer [duplicate]
                            
                                In python, what the underline parameter mean in function
                            
                                Python - find integer closest to 0 in list [duplicate]
                            
                                How should I use Numpy's vstack method?
                            
                                Python Requests non-blocking? [duplicate]
                            
                                How can I use Conda to install MySQLdb?
                            
                                Get command line arguments as string
                            
                                Can't install pygame with pip as there is an error whilst runningvsetup.py bdist_wheel for pygame [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to normalize the Train and Test data using MinMaxScaler sklearn

Tags:

python

machine-learning

scikit-learn

normalization

sklearn-pandas

Tia

People also ask

2 Answers

You should fit the `MinMaxScaler` using the `training` data and then apply the scaler on the `testing` data before the prediction.

seralouk

vipin bansal

Recent Activity

Donate For Us

How to normalize the Train and Test data using MinMaxScaler sklearn

Tags:

python

machine-learning

scikit-learn

normalization

sklearn-pandas

Tia

People also ask

2 Answers

You should fit the MinMaxScaler using the training data and then apply the scaler on the testing data before the prediction.

seralouk

vipin bansal

Related questions

Recent Activity

Donate For Us

You should fit the `MinMaxScaler` using the `training` data and then apply the scaler on the `testing` data before the prediction.