Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn: Predicting new raw and unscaled instance using models trained with scaled data

I have produced different classifier models using scikit-learn and this has been smooth-sailing. Due to differences of units in the data (I got the data from different sensors labeled by their corresponding categories), I opted to scale the features using the StandardScale module.

Resulting accuracy scores of the different machine learning classifiers were fine. However, when I try to use the model to predict a raw instance (meaning unscaled) of sensor values, the models output wrong classification.

Should this really be the case because of the scaling done to the training data? If so, is there an easy way to scale the raw values too? I would like to use model persistence for this using joblib and it would be appreciated if there is a way to make this as modular as possible. Meaning to say, not to record mean and standard variation for each feature every time the training data changes.

like image 953
jc1012 Avatar asked Feb 08 '16 18:02

jc1012


People also ask

How do you connect model input data with predictions for machine learning?

To give inputs to a machine learning model, you have to create a NumPy array, where you have to input the values of the features you used to train your machine learning model. Then we can use that array in the model. predict() method, and at the end, it will give the predicted value as an output based on the inputs.

What does predict () function of sklearn do?

The Sklearn 'Predict' Method Predicts an OutputThat being the case, it provides a set of tools for doing things like training and evaluating machine learning models. What is this? And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).

What method does scikit-learn to find the best classification hypothesis for the training data?

Linear discriminant analysis, as you may be able to guess, is a linear classification algorithm and best used when the data has a linear relationship.


1 Answers

Should this really be the case because of the scaling done to the training data?

Yes, this is expected behavior. You trained your model on scaled data, thus it will only work with scaled data.

If so, is there an easy way to scale the raw values too?

Yes, just save your scaler.

# Training
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
...
# do some training, probably save classifier, and save scaler too!

then

# Testing
# load scaler
scaled_instances = scaler.transform(raw_instances)

Meaning to say, not to record mean and standard variation for each feature every time the training data changes

This is exactly what you have to do, although not by hand (as this is what scaler computes), but essentialy "under the hood" this is what happens - you have to store means/stds for each feature.

like image 186
lejlot Avatar answered Sep 19 '22 12:09

lejlot