I'm trying to learn scikit-learn
and Machine Learning by using the Boston Housing Data Set.
# I splitted the initial dataset ('housing_X' and 'housing_y')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_X, housing_y, test_size=0.25, random_state=33)
# I scaled those two datasets
from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)
X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train)
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test)
# I created the model
from sklearn import linear_model
clf_sgd = linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=42)
train_and_evaluate(clf_sgd,X_train,y_train)
Based on this new model clf_sgd
, I am trying to predict the y
based on the first instance of X_train
.
X_new_scaled = X_train[0]
print (X_new_scaled)
y_new = clf_sgd.predict(X_new_scaled)
print (y_new)
However, the result is quite odd for me (1.34032174
, instead of 20-30
, the range of the price of the houses)
[-0.32076092 0.35553428 -1.00966618 -0.28784917 0.87716097 1.28834383
0.4759489 -0.83034371 -0.47659648 -0.81061061 -2.49222645 0.35062335
-0.39859013]
[ 1.34032174]
I guess that this 1.34032174
value should be scaled back, but I am trying to figure out how to do it with no success. Any tip is welcome. Thank you very much.
The Sklearn 'Predict' Method Predicts an OutputThat being the case, it provides a set of tools for doing things like training and evaluating machine learning models. What is this? And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).
fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.
fit_transform() then it will calculate the mean(μ) and standard deviation(σ) of the feature F at a time it will transform the data points of the feature F.
You can use inverse_transform
using your scalery
object:
y_new_inverse = scalery.inverse_transform(y_new)
Bit late to the game: Just don't scale your y. With scaling y you actually loose your units. The regression or loss optimization is actually determined by the relative differences between the features. BTW for house prices (or any other monetary value) it is common practice to take the logarithm. Then you obviously need to do an numpy.exp() to get back to the actual dollars/euros/yens...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With